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Accomplishments 


In  this  grant  we  addressed  issues  related  to  the  two  major  components  of  our  proposed 
research  for  Space-Time  Adaptive  Processing  (STAP)  applications,  namely  (i)  the  development 
of  advanced  algorithms  for  detection  and  parameter  estimation  of  weak  targets  in  the  presence 
of  jamming  and  radar  clutter,  and  (ii)  the  mapping  of  the  algorithms  into  massively  parallel 
architectures  for  their  high-speed  implementation.  Our  accomplishments  during  this  grant  may 
be  summarized  as  follows: 


Advances  in  Detection  and  Estimation  for  STAP  Applications 
with  Joint  Gaussian  Statistics 
Investigator:  Professor  Irving  S.  Reed 

1.  The  complete  theoretical  development  of  the  Cross-Spectral  Method,  an  optimal  technique 
for  reduced-rank  STAP.  The  performance  measure  for  optimality  is  maximizing  the  output 
signal-to-interference  plus  noise  ratio,  which  is  equivalent  to  minimizing  the  mean-square 
error  of  the  processor.  A  proof  of  optimality  has  been  obtained  and,  while  counter-intuitive, 
the  resulting  solution  is  shown  to  be  different  from  that  yielded  by  the  principal  component 
approximation  of  the  array  covariance  matrix.  The  rank  of  the  resulting  weight  vector 
is  reduced  by  the  cross-spectral  method,  in  contrast  to  the  principal  component  inverse 
(PCI) /Eigencanceler  methods.  Finally,  a  low  complexity  filtering-based  implementation  of 
the  cross-spectral  method  has  been  investigated; 

2.  The  design  of  a  simultaneous  CFAR  detection  and  maximum  likelihood  (ML)  estimation 
STAP  algorithm  for  airborne  radar.  Generalization  of  our  previous  results  on  element-space 
post-Doppler  STAP  to  fully  adaptive,  beam-space  post-Doppler  processors.  This  study 
demonstrates  that  the  finer  resolution  leads  to  a  reduction  in  the  detection  capability.  This 
implies  that  one  may  trade-off  the  accuracy  of  ML  estimation  with  the  performance  of  the 
CFAR  detection  criterion; 

3.  The  development  of  a  fast  constant  false  alarm  (CFAR)  detection  algorithm  for  STAP.  This 
approach  is  based  on  the  eigen-decomposition  of  the  space-time  radar  covariance  matrix. 
The  method  utilized  by  this  algorithm  does  not  require  a  matrix  inversion,  thereby  reducing 
the  computational  complexity  requirements  of  the  processor. 

Robust  Detection  and  Estimation  with  Alpha-Stable 
Distributions  for  STAP  Applications 
Investigator:  Professor  Chrysostomos  L.  Nikias 

1.  The  characterization  and  development  of  methods  for  parameter  estimation  of  symmetric 
alpha-stable  {SaS)  distributions  by  means  of  fractional  lower-  and  negative-order  moments. 
The  modeling  of  the  amplitude  statistics  of  radar  clutter  by  means  of  SaS  distributions 
and  the  estimation  of  the  parameters  of  the  stable  distributions  from  real  clutter  of  the 
Mountain  Top  Database; 
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2.  The  development  of  an  adaptive  matched-filter  detector  for  the  case  of  a  signal  embedded  in 
heavy-tailed  noise  modeled  as  a  sub-Gaussian,  alpha-stable  process.  The  implementation  of 
a  generalized  likelihood  ratio  detector  which  employs  robust  estimates  of  the  unknown  noise 
underlying  matrix  and  the  unknown  signal  strength.  The  comparison  of  the  performance 
of  the  new  detector  to  the  performance  of  the  optimum  detector  under  completely  known 
signal  and  Gaussian  noise  characteristics; 

3.  The  theoretical  analysis  on  Cramer-Rao  bounds  for  target  angle  and  Doppler  estimation  for 
airborne  radar  operating  in  interference  modeled  as  a  Cauchy  process.  The  development  of 
optimal,  maximum  likelihood-based  approaches  to  the  estimation  problem  and  the  intro¬ 
duction  of  the  Cauchy  Beamformer  which  exhibits  robust  performance  in  a  wide  range  of 
impulsive  noise  environments.  The  development  of  new  joint  spatial-  and  Doppler-frequency 
high-resolution  estimation  techniques  based  on  the  eigendecomposition  of  the  covariation 
matrix  of  the  space-time  radar  measurements. 


Scalable  Portable  Parallel  Algorithms  for  STAP  Applications 
Investigator:  Professor  Viktor  K.  Prasanna 


1.  The  development  of  efficient  algorithms  for  parallel  Higher  Order  Post-Doppler  Process¬ 
ing.  Our  algorithms,  which  exploit  remapping  data  between  various  computation  steps  and 
scheduling  the  computations  to  overlap  communication,  lead  to  scalable  parallel  algorithms. 
The  previous  approach  employs  row  wrap  mapping  scheme  to  avoid  data  remapping.  How¬ 
ever,  the  communication  time  increases  as  the  number  of  processors  increases.  Our  scalabil¬ 
ity  analysis  and  experimental  results  show  that  the  frequent  short  message  communications 
of  the  earlier  fine  grain  algorithms  can  be  eliminated  by  data  remapping; 

2.  The  development  of  parallel  weight  computation  steps  of  Element  Space  Pre-Doppler  and 
Element  Space  Post-Doppler  STAP,  which  are  the  major  computational  bottlenecks  in  the 
entire  processing.  The  tasks  are  partitioned  evenly.  The  parallelism  is  achieved  by  dis¬ 
tributing  whole  tasks  among  the  processing  elements.  The  reason  for  employing  task  level 
parallelism  is  that  the  problem  sizes  of  the  tasks  in  Element  Space  STAP  are  too  small  to 
parallelize  efficiently.  Despite  advances  in  network  technology,  state-of-the-art  HPC  plat¬ 
forms  still  consume  longer  time  in  communicating  than  in  computing  a  unit  data.  Hence, 
coarse  grain  parallelism  is  more  suitable  for  these  platforms.  Thus,  eliminating  sparse  short 
communications  by  providing  data  required  in  each  task  based  on  owner  compute  rule,  can 
achieve  much  greater  speed-up; 

3.  The  development  of  efficient  data  distribution  algorithms  which  exploit  intermediate  data 
remapping  schemes  and  communication  contexts.  The  proposed  scheme  is  a  two  phase  ap¬ 
proach  which  exploits  combine-and-forward  techniques.  By  using  intermediate  data  map¬ 
ping,  the  introduced  algorithms  effectively  reduce  startup  overhead  and  node  contention. 
Our  algorithms  also  utilize  pipelined  communication  scheduling. 
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Part  1 


Advances  in  Detection  and 
Estimation  for  STAP  Applications 
with  Joint  Gaussian  Statistics 


Principal  Investigator:  Professor  Irving  S.  Reed 

Collaborators:  Jay  Scott  Goldstein  and  Yow-Ling  Gau 

This  part  includes  results  on  optimal  techniques  with  joint  Gaussian  statistics  for  reduced-rank 
STAP  applications,  CFAR  detection,  and  maximum  likelihood  estimation.  First,  in  Section  1.1, 
the  complete  theoretical  development  of  the  optimal  technique  for  reduced-rank  STAP,  termed 
the  cross-spectral  method,  has  been  accomplished.  The  performance  measure  for  optimality  is 
maximizing  the  output  signal-to-interference  plus  noise  ratio,  which  is  equivalent  to  minimizing 
the  mean-square  error  of  the  processor.  A  proof  of  optimality  has  been  obtained  and,  while 
counter-intuitive,  the  resulting  solution  is  shown  to  be  different  from  that  yielded  by  the  principal 
component  approximation  of  the  array  covariance  matrix.  The  rank  of  the  resulting  weight  vector 
is  reduced  by  the  cross-spectral  method,  in  contrast  to  the  PCI/Eigencanceler  methods.  Finally,  a 
low  complexity  filtering-based  implementation  of  the  cross-spectral  method  has  been  investigated, 
which  provides  excellent  performance  while  alleviating  the  requirements  for  covariance  matrix 
estimation  and  eigen-decomposition. 

In  Section  1.2,  we  design  a  simultaneous  CFAR  detection  and  ML  estimation  STAP  algorithm 
for  airborne  radar.  Our  previous  results  on  element-space  post-Doppler  STAP  have  been  gen¬ 
eralized  and  applied  to  the  fully  adaptive  and  beam-space  post-Doppler  processors.  This  study 
demonstrates  that  the  finer  resolution  leads  to  a  reduction  in  the  detection  capability.  This  im¬ 
plies  that  one  may  trade-off  the  accuracy  of  ML  estimation  with  the  performance  of  the  CFAR 
detection  criterion. 

Finally,  in  Section  1.3,  we  develop  a  fast  CFAR  detection  algorithm  for  STAP  applications. 
This  approach  is  based  on  the  eigen-decomposition  of  the  space-time  radar  covariance  matrix. 
The  method  utilized  by  this  algorithm  does  not  require  a  matrix  inversion,  thereby  reducing  the 
computational  complexity  requirements  of  the  processor. 
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1.1  The  Cross-Spectral  Metric  and  Optimal  Rank  Reduction 

The  problem  of  optimal  detection  is  closely  related  to  constrained  Wiener  filtering  [1,  2,  3,  4,  5,  6]. 
In  fact,  the  optimal  detection  criterion  relies  upon  steering  a  beam  in  the  direction  and  Doppler 
of  interest,  while  nulling  out  the  effects  of  interference  and  clutter  in  that  beam.  In  this  sense, 
the  canonical  form  of  the  optimal  detector  takes  the  form  of  a  partitioned  processor  termed  the 
Generalized  Sidelobe  Canceller  (GSC).  The  GSC  was  introduced  by  Applebaum  and  Chapman 
[7]  for  narrowband  signals  and  extended  to  the  space-time  problem  by  Griffiths  [8].  In  this 
section,  the  theoretical  development  of  optimal  rank  reduction  for  space-time  adaptive  processing 
is  achieved.  The  cross-spectral  metric  [9,  10,  11,  12]  is  derived  as  the  optimal  technique  to  reduce 
the  rank  of  the  weight  vector  while  maximizing  the  output  signal-to-interference  plus  noise  ratio 
(SINR) .  The  methodology  for  achieving  this  optimization  is  to  minimize  the  mean-square  error 
of  the  GSC  as  a  function  of  the  weight  vector  rank.  Minimizing  the  mean-square  error  and 
maximizing  the  output  SINR  for  the  GSC  processor  have  been  shown  to  be  equivalent  [6].  Indeed, 
it  is  easily  shown  that  the  weight  vector  which  maximizes  the  SINR  for  a  constrained  processor 
and  the  weight  vector  that  minimizes  the  mean-square  error  are  identical. 

It  is  important  to  note  that  other  reduced-rank  STAR  techniques  which  are  based  on  the 
principal  component  technique,  such  as  the  principal  component  inverse  (PCI)  method  [13,  14] 
(also  recently  termed  an  Eigencanceler  [15])  only  form  a  low  rank  estimate  of  the  array  covari¬ 
ance  matrix.  That  is  to  say,  while  the  covariance  matrix  is  approximated  by  only  the  principal 
eigenvectors  and  their  corresponding  eigenvalues,  the  resulting  weight  vector  is  still  full-rank. 
The  approach  described  herein  differs  from  all  other  approaches  in  three  significant  ways:  (1): 
the  principal  component  technique  provides  the  best  low-rank  approximation  to  the  full  rank 
covariance  matrix  [16].  However,  this  is  not  the  correct  performance  measure.  The  correct  per¬ 
formance  measure  is  to  find  the  best  low-rank  approximation  of  the  covariance  matrix  (i.e.:  the 
best  reduced-rank  subspace)  for  maximizing  the  output  SINR.  While  counter-intuitive,  these 
two  subspaces  are  not  the  same.  (2)  The  actual  number  of  adaptive  weights  required  by  the  PCI 
technique  is  identical  to  that  required  in  the  full-rank  case.  The  cross-spectral  technique  results 
in  a  weight  vector  whose  size  corresponds  to  the  reduced  dimension.  This  means  that  using  the 
cross-spectral  approach  to  reduce  the  rank  of  the  problem  from  some  large  number  N  to  some 
small  .number  M  yields  a  weight  vector  of  dimension  M.  Since  the  computational  complexity 
and  required  hardware  increases  with  the  rank  of  the  weight  vector,  this  issue  is  paramount.  (3) 
Finally,  the  development  of  the  cross-spectral  metric  naturally  lends  itself  to  approximations  us¬ 
ing  any  unitary  transform  to  form  a  basis  for  the  space  spanned  by  the  array  covariance  matrix. 
Thus,  the  eigenvectors  need  not  be  used  and  the  eigen-decomposition  of  the  array  covariance 
matrix  need  not  be  performed.  Further,  since  it  is  possible  to  realize  a  unitary  operator  via  fast 
filtering  algorithms,  such  as  the  DFT  and  DCT,  one  may  evaluate  the  cross-spectral  metric  and 
perform  rank  reduction  without  ever  forming  the  array  covariance  matrix  estimate  from  the  data. 
This  in  turn  implies  that  the  rank  reduction  may  be  performed  via  a  criterion  which  is  based 
only  on  simple  filtering  operations. 

1.1.1  Space-Time  Adaptive  Processing 

Radar  returns  are  collected  in  coherent  processing  intervals  (CPI)  which  may  be  represented  as  a 
3-D  data  cube  as  shown  in  Figure  1.1.  The  data  is  then  processed  at  one  range  of  interest,  which 
corresponds  to  a  slice  of  the  CPI  data  cube.  This  slice  is  a  J  x  K  space-time  snapshot,  denoted 
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Figure  1.1:  3-Dimensional  CPI  Data  Cube 

X.  The  space-time  snapshot  can  be  expressed  as  a  matrix 

2^0,0  a-’o,! 

a:i,o  2:1,1 

X  = 

xj-ifi  2:j_i,i 

where  the  individual  elements  Xj^k  correspond  to  the  data  from  the  j-th  pulse  repetition  interval 
(PRI)  and  the  k-th.  sensor  element  [17,  18]. 

The  2-D  space-time  data  structure  consists  of  element  space  information  and  PRI  space 
Doppler  information.  Each  return  X  is  composed  of  components  due  to  the  target,  the  interference 
sources  or  jammers,  clutter,  and  white  noise: 

X  =  Xt  +  Xi  +  Xe  +  X^.  (1.2) 

The  noise  in  the  space-time  snapshot  is  the  thermal  noise  present  at  the  sensor  elements,  which 
is  spatially  and  temporally  white.  Each  of  the  other  components  of  the  radar  return  are  now 
addressed. 

The  target  is  assumed  to  be  a  point  target,  present  in  only  one  range  bin.  The  J  x  K 
space-time  snapshot  corresponding  to  the  target  is  then  given  by 

Xt  =  ath[wt)a['dt)^ ,  (1.3) 

where  a('!9f)  is  the  K  x  1  spatial  steering  vector,  b(a;t)  is  the  J  x  1  temporal  steering  vector,  and 
at  is  the  amplitude  of  the  target. 

The  interference  environment  consists  of  several  sources  which  are  assumed  to  be  temporally 
white  (barrage  jammers  spread  over  all  Doppler  frequencies  at  a  particular  azimuth).  The  J  x  K 
space-time  snapshot  corresponding  to  the  jammers  is  then  given  by 

X,  =  (1.4) 


2:0,  A'- 1 
2:i,A'-i 

2:j-i,A'-i 


(1.1) 
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where  oci  is  the  J  x  I  matrix  of  jammer  amplitude  for  the  J  PRIs  of  each  of  the  I  interference 
sources  and  A{'0i)  is  the  K  x  I  matrix  whose  row  are  the  spatial  steering  vectors  for  each 
interference  source. 

The  clutter  returns  when  considering  a  target  at  a  given  range  will  come  from  all  areas  within 
a  common  range  about  the  radar  platform.  Consider  returns  from  a  circular  ring  about  the 
platform.  If  each  spot  on  the  clutter  ring  is  considered  to  be  a  point  target,  its  space-time 
snapshot  will  take  the  form 

Xcp  =  Q'cpb(wcp)a(??cp)^,  (1.5) 

where  Wcp  and  the  spatial  and  Doppler  frequencies  of  the  clutter  return  from  the  p-th 

patch,  and  a^p  is  the  power  from  the  p-th  clutter  patch.  The  clutter  is  assumed  to  be  stationary 
and  it  has  a  velocity  relative  to  the  platform  housing  the  radar  which  is  determined  by  the  azimuth 
angle  between  the  array  and  the  clutter  patch.  For  simulation  purposes,  the  clutter  returns  are 
assumed  to  come  from  the  entire  clutter  ring,  which  is  divided  into  360  evenly  spaced  sectors, 
and  summed  so  that 

360 

Xc  =  X)  ®cpb(/3i9cp)a(i?cp)^,  (1.6) 

^cp  =  \ 

where  fS  is  the  constant  of  linearity  between  the  Doppler  frequency  and  the  spatial  frequency  and 

i?cp  =  •^cos^cSind>cp,  (1.7) 

with  d  being  the  element  spacing,  Xq  being  the  wavelength,  and  6c  being  the  fixed  elevation  angle. 

Each  of  the  components  of  the  space-time  snapshots  are  assumed  to  be  uncorrelated.  These 
component  matrices  are  stacked  column-wise  to  form  the  KJ  x  1  vector  x  =  x*  -f  x;  Xc  -|-  Xu,. 
The  resulting  space-time  covariance  matrix  is  the  KJ  X  KJ  matrix  Ra,: 

Ra;  =  E[xx^]  =  R( -b  Rj -1- Rc -I- Rtu,  (1.8) 

where  R^,  R,,  Rc,  and  R^,  are  the  target  signal,  interference,  clutter,  and  white  noise  covariance 
matrices,  respectively. 

1.1.2  The  GSC  Form  Processor 

The  structure  of  the  processor  under  consideration  is  the  GSC  partitioned  form  processor.  The 
received  snapshot  is  of  dimension  KJ,  which  is  the  product  of  the  number  of  sensor  elements  and 
the  number  of  Doppler  frequency  bins.  The  GSC  is  implemented  by  partitioning  the  received  data 
with  filters  Wc  and  Wj.  The  conventional  beamforming  filter  Wc  is  a  vector  which  enforces  the 
look-direction  constraint  in  angle  and  Doppler.  This  desired  signal  is  blocked  from  the  adaptive 
processor  through  the  signal  blocking  matrix  W^,  which  in  general  is  of  dimension  N  xKJ  where 
N  <  KJ.  For  example,  in  the  single  linear  constraint  case,  N  =  KJ  —  1.  The  full  row  rank 
matrix  is  composed  of  rows  a*  such  that  ajWc  =  0  for  f  =  1, 2, . .  .KJ. 

As  previously  described,  the  KJx  1-dimensional  received  signal  vector  present  on  the  antenna 
elements  at  time  k  is  denoted  by  x(A;)  and  the  associated  KJ  x  KJ  received  input  data  covariance 
matrix  is  denoted  by  R^,  =  E[x(fc)x'^(A;)].  The  A’-dimensional  noise  subspace  data  vector  'x.s{k), 
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Figure  1.2:  The  full-rank  GSC  MVDR  processor. 


the  scalar  beamformed  output  d{k)^  and  the  scalar  beamformed  noise  estimator  ys{k)  are  given 

by 


yis{k) 

=  W,x(A;), 

d{k) 

=  wfx(A;), 

(1.9) 

ys{k) 

=  w^Xs(A;), 

where  w  is  the  A^-dimensional  weight  vector.  Finally,  the  array  output  is 

y[k)  =  (wf  -  x(A:).  (I-IO) 

Evidently,  the  mean-square  error  of  the  processor  is  given  by  the  mean-square  value  of  y{k). 

The  GSC  array  in  Figure  1.2  converges  to  the  discrete-time  Wiener-Hopf  solution  given  by 

w  =  (1.11) 

where  the  observation  data  covariance  matrix  is  expressed  as 

R,,  =  E[x,(fc)xf  (A:)]  =  W,R,Wf ,  (1.12) 

and  the  cross-correlation  vector  between  the  noise  subspace  data  vector  and  the  beamformer 
output  is  given  by 

=  B[^s{k)d*{k)]  =  W,R,w,.  (1.13) 
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Figure  1.3:  The  full-rank  GSC  MVDR  processor  in  principal  coordinates. 

The  minimum  mean-square  error,  denoted  P,  is  found  by  substituting  (1.11)  into  (1.10)  and 
evaluating  the  mean-square  value  of  y{k): 

P  =  E[|y(A:)p]  =  aj  -  (1-14) 

where  (t|  —  w^Ra:a:Wc  is  the  variance  of  the  conventional  beamformer  output. 

The  observation  data  covariance  matrix  may  be  expressed  in  terms  of  its  eigenvectors 
and  eigenvalues  as 

R,,  =  UAU^,  (1.15) 

where  U  is  a  unitary  N  X  N  matrix  composed  of  the  eigenvectors  ^  diagonal 

matrix  of  associated  eigenvalues  In  terms  of  the  principal  coordinates  of  the  problem,  we 

define  the  process  p(A;)  =  U'^Xs(A;).  A  normal  component  covariance  matrix  Rp,  cross-correlation 
vector  Tpd^  and  Wiener  filter  are  defined  now  as  follows: 

Rp  =  E[p(A:)p^(A:)]  =  U^R,,U  =  A, 

rp,  =  E[p{k)d*{k)]  =  V^r,,d,  (1-16) 

=  R“^rpd  =  U^w. 

The  GSC  in  these  normal  coordinates,  depicted  in  Figure  1.3,  is  equivalent  to  the  GSC  in  Figure 
1.2  in  terms  of  its  steady-state  characteristics.  The  array  output  of  the  GSC  in  normal  coordinates 
is  given  by 

y  =  (wf  -  w^U^W,)  x{k)  =  (wf  -  w^W,)  x(A;) .  (1.17) 

Note  that  the  MMSE, 

P  =  (^d-  =  ^d-  ^xsd^zl^xsd^  (1-18) 

is  conserved  by  any  unitary  transformation,  including  that  realized  by  the  operator  U. 
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Figure  1.4:  The  reduced-rank  GSC  MVDR  processor. 

1.1.3  Partially  Adaptive  Processing 

The  problem  of  reducing  the  degrees  of  freedom  for  an  array  processor  involves  selecting  a  subset 
or  some  combination  of  the  elements  to  be  adaptively  weighted.  For  notational  purposes,  we  let 
the  space  spanned  by  the  columns  of  the  fully  adaptive  array  covariance  matrix  be  denoted  by 
,  implying  that  the  observation  covariance  matrices  and  Rp  are  of  dimension  N  x  N  and 
the  vectors  w,  and  are  N  x  1-dimensional  vectors.  The  partially  adaptive  GSC 

shown  in  Figure  1.4  utilizes  axi  N  x  M  transformation  operator  U,  in  place  of  U  in  Figure  1.3, 
to  form  the  M-dimensional  reduced-rank  observation  data  vector 

z{k)  =U^Xs{k),  (1-19) 

where  M  <  N.  The  associated  M  x  M  reduced-rank  covariance  matrix  is  given  by 

R,=U^R^gU.  (1.20) 

The  data  vector  z(fc)  is  then  processed  by  the  reduced-rank  weight  vector  which  is  of  di¬ 
mension  M  X  1.  It  is  the  selection  of  the  rank  reducing  operator  U  which  serves  as  the  present 
topic  of  interest. 

The  most  popular  technique  for  subspace  selection  is  based  upon  the  principal  components 
method  [19,  20,  21].  This  method  determines  the  singular  value  decomposition  of  the  N  x  N- 
dimensional  covariance  matrix  R^^^  and  selects  the  M  largest  eigenvectors  (those  corresponding  to 
the  largest  eigenvalues)  to  form  the  M-dimensional  eigen-subspace  ^  C  in  which  the  adaptive 
processor  operates.  However,  this  technique  does  not  directly  consider  the  MMSE  performance 
measure,  which  is  a  function  of  not  only  the  space  spanned  by  the  noise  covariance  matrix  but 
also  of  the  cross-correlation  between  the  desired  signal  and  the  noise  process.  It  is  noted  that 
Byerly  [22]  discovered  that  the  eigenvectors  corresponding  to  the  largest  eigenvalues  were  not 
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necessarily  the  best  selection,  but  there  was  no  derivation  provided  and  the  approach  obtained 
herein  provides  a  more  general  solution. 

Following  Scharf  [3,  ch.  8],  we  examine  the  problem  of  reducing  the  rank  of  the  Wiener  filter. 
In  Figure  1.4,  the  rank  reducing  N  xM  transformation  matrix  U  is  composed  of  some  M  columns 
from  U  which  are  to  be  selected.  The  operator  U  is  constrained  therefore  to  be  a  subset  of  M  of 
the  N  possible  eigenvectors  of  R^,^.  This  particular  constraint  allows  a  direct  comparison  with 
the  principal  component  technique,  which  chooses  the  rank  reducing  transform  to  be  composed  of 
those  M  eigenvectors  corresponding  to  the  largest  M  eigenvalues.  Thus,  the  particular  problem 
at  hand  is  to  choose  the  subspace  spanned  by  a  set  of  M  eigenvectors  out  of  the  N  available  such 
that  the  resulting  M-dimensional  Wiener  filter  yields  the  lowest  MMSE  out  of  all  (j^)  possible 
combinations  of  eigenvectors. 

Now  denote  the  reduced-rank  processor  output  by  yr(k)  and  the  reduced-rank  noise  estimator 
by  S/srW-  This  is  illustrated  in  Figure  1.4.  A  study  of  Figures  1.3  and  1.4  suggest  that  the 
reduced-rank  processor  output  may  be  expressed  as 

y,(fc)=[l  .  (1.21) 

Denote  the  weight  error  vector  between  the  full-rank  weight  vector  and  its  reduced-rank  version 
by 

e  =  Uwj^  —  UwM-  (1.22) 

The  mean-square  value  of  the  reduced-rank  processor  output  yr  (k)  in  this  notation  is  computed 
to  be 

E[\yr{k)\^]  =  P  +  e^R,,e,  (1.23) 

where  P  is  the  full-rank  MMSE  and  P  +  is  the  reduced-rank  MMSE.  It  is  now  desired 

to  choose  the  rank  reducing  operator  U  in  such  a  manner  that  Wm  minimizes  the  additional 
mean-square  error  incurred  by  rank  reduction,  namely  the  scalar  term  e^R^j^e  in  (1.23),  as 
follows: 

min  l^e^Rx^e  =  min  [(w?Ry^  -  )"]  .  (1.24) 

Then  evidently  the  best  solution  for  (1.24)  is  to  choose  U  such  that  w^U^rH'^  =  w^R^^  is 
the  best  low  rank  approximation  to  the  vector  w^R^^. 

The  Wiener-Hopf  relationship  for  the  full-rank  case, 

RpWw  =  Tpd,  (1.25) 
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Thus,  in  order  to  make  the  vector  be  the  best  low  rank  approximation  to  the  vector 

w^Ry^  it  is  necessary  to  rank  order  the  terms  in  (1.26)  by  their  magnitude.  Then  the  rank 
reducing  operator  U  is  selected  by  choosing  those  M  eigenvectors  which  correspond  with  the 
largest  M  values  of  the  sequence  of  non-negative  terms 


-y/Ai 


(1.27) 


for  i  =  1,2,.  ..N.  With  this  selection,  the  columns  of  the  reduced-rank  covariance  matrix  R^ 
span  the  M-dimensional  cross-spectral  subspace  17  C  to  provide  the  lowest  MMSE  of  any 
M-dimensional  subspace  which  is  spanned  by  M  of  the  N  columns  of  U.  It  is  noted  that  this 
solution  is  similar  to  the  SVD  technique  described  in  [3,  ch.  8.4].  Also  it  is  of  interest  physically 
that  the  term  in  (1.27)  measures  the  cross-spectral  energy  projected  along  the  i-th  eigenvector. 

Hence  the  reduced-rank  Wiener  filter  in  the  subspace  17  is  given  by 


(1.28) 


where  Am  is  the  diagonal  matrix  composed  of  the  M  eigenvalues  corresponding  to  the  eigenvectors 
which  form  U.  Clearly,  the  subspace  17  spanned  by  the  columns  of  eigenvectors  corresponding 
with  the  M  largest  values  of  the  cross-spectral  metric  is  not  the  same  as  the  subspace  iZ'  spanned 
by  the  eigenvectors  corresponding  with  the  M  largest  eigenvalues.  This  means  that  the  Wiener 
filter  in  the  cross-spectral  subspace  17  yields  a  lower  MMSE  than  the  Wiener  filter  in  the  subspace 

To  demonstrate  that  the  cross-spectral  metric  is  optimal  for  each  rank  M  <  N ,  consider  the 
decomposition  of  the  MMSE  performed  by  the  full-rank  matrix  of  eigenvectors  U.  The  full-rank 
MMSE  is  given  by 


— ■ 

i=l 


(1.29) 


Finally,  a  comparison  of  Equations  (1.27)  and  (1.29)  demonstrate  that  the  selection  of  the  sub¬ 
space  which  provides  the  largest  cross-spectral  contribution  also  results  in  the  lowest  MMSE  as  a 
function  of  the  rank  of  the  Wiener  filter.  Since  minimizing  the  mean-square  error  and  maximizing 
the  output  SINK  are  equivalent  for  the  constrained  array,  the  reduced-rank  Wiener  filter  found 
via  the  cross-spectral  metric  maximizes  the  output  SINK  as  a  function  of  the  rank  of  the  filter. 


1.1.4  Example  Using  the  Mountaintop  Radar  Parameters 

A  simulation  is  now  considered  which  utilizes  the  parameters  of  the  Mountaintop  radar  to  com¬ 
pare  the  performance  of  the  reduced-rank  processors  to  that  of  the  full-rank,  joint  domain  op¬ 
timal  processor.  In  particular,  the  output  SINK  for  the  full-rank  processor,  the  eigen-subspace 
reduced-rank  processor  and  two  cross-spectral  subspace  reduced-rank  processors  will  be  com¬ 
puted.  Finally,  the  output  SINK  as  a  function  of  Wiener  filter  rank  will  be  presented  for  these 
processors  and  contrasted  to  that  resulting  from  the  standard  technique  of  simply  choosing  the 
largest  principal  components  of  the  noise  covariance  matrix. 
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The  second  cross-spectral  subspace  processor,  mentioned  in  the  preceeding  paragraph,  utilizes 
the  DCT  in  place  of  the  eigenvectors  of  the  array  covariance  matrix.  This  new  frequency  domain 
processor  has  an  extremely  low  computational  complexity.  The  trade-off  is  a  minor  loss  in 
performance,  to  be  demonstrated  shortly. 

A.  Description  of  the  Mountaintop  Radar  and  the  Signal  Environment 

The  Mountaintop  radar  employs  the  Radar  Surveillance  Technology  Experimental  Radar  (RSTER) 
and  the  Inverse  Displaced  Phase  Center  Array  (IDPCA)  co-located  at  the  same  site  [23].  It  is 
assumed  that  the  radar  is  in  the  RSTER-90  configuration  and  receive-only  mode.  The  transmit 
frequency  is  450  MHz.  This  radar  consists  of  A'=14  elements  and  J=16  pulses  in  the  PRI.  The 
elevation  angle  9  is  fixed  (pre-beamformed),  and  the  azimuth  angle  (f)  is  the  only  free  parameter. 

The  processing  is  performed  over  one  range  gate  of  interest,  where  it  is  assumed  that  this 
range  gate  contains  the  target  signal.  Thus,  the  number  of  inputs  to  the  adaptive  processor  is 
KJ  =  224.  The  GSC  processor  beamforms  the  target  signal  in  one  branch  and  blocks  the  target 
signal  from  the  other  branch,  as  described  earlier  in  this  section.  The  noise  subspace  data  vector 
is  then  of  dimension  N  =  KJ  -  I  =  223.  The  corresponding  full-rank  GSC  weight  vector  is  of 
dimension  223  X  1. 

For  the  purpose  of  this  analysis,  three  barrage  jammers  and  clutter  compose  the  interference 
environment,  and  one  target  signal  is  present.  The  radar  platform  is  assumed  to  be  at  a  height 
of  500  meters,  and  the  platform  velocity  is  500  meters  per  second.  The  target,  which  is  also  at  a 
height  of  500  meters,  has  a  range  of  22  kilometers  and  an  azimuth  of  -30°.  The  target  velocity 
is  250  meters  per  second.  This  target  has  a  signal-to-noise  ratio  (SNR)  of  0  dB,  represents  a  very 
small  constant  radar  cross-section  target.  The  three  jammers  each  have  an  SNR  of  40  dB  and 
azimuth  angles  of  -60°,  30°  and  60°,  respectively.  The  clutter-to- noise  ratio  is  45.6  dB. 

B.  Analysis  and  Results 

We  now  examine  the  performance  of  the  fully  adaptive  GSC,  the  partially  adaptive  eigen-subspace 
GSC,  and  the  partially  adaptive  cross-spectral  subspace  GSC  processors.  To  evaluate  the  per¬ 
formance  of  the  subspace  selection  techniques,  the  dimension  of  the  adaptive  processor  will  be 
reduced  from  N  =  223  weights  to  M  =  50.  The  dimension  of  the  noise  subspace  eigenstructure 
is  77,  which  is  supposedly  the  lower  bound  for  rank  reduction  [19,  20]. 

The  power  spectrum  of  the  data,  averaged  over  500  snapshots,  is  presented  in  Figure  1.5, 
from  which  it  is  evident  that  the  target  signal  is  not  discernible.  Figure  1.6  presents  the  full  rank 
Wiener  filter  power  spectrum,  from  which  it  is  seen  that  the  optimal  space-time  weight  vector  of 
dimension  223  is  capable  of  attenuating  all  jammers  and  clutter  while  passing  the  target  signal. 
The  full-rank  Wiener  filter  SINR  is  23.28  dB.  Figure  1.7  depicts  the  rank  50  eigen-subspace 
Wiener  filter.  The  clutter  is  attenuated,  but  most  of  the  power  from  the  three  jammers  is  passed 
with  relatively  high  gain,  particularly  the  jammer  at  an  azimuth  angle  of  30°.  The  output  SINR 
for  the  eigen-subspace  Wiener  filter  is  7.65  dB,  reflecting  a  loss  of  15.63  dB.  The  power  spectrum 
of  the  rank  50  cross-spectral  subspace  Wiener  filter  is  depicted  in  Figure  1.8.  It  can  be  seen 
that  by  simply  selecting  a  different  set  of  50  eigenvectors,  the  resulting  power  spectrum  closely 
approximates  the  full  rank  optimal  and  all  clutter  and  jammers  are  attenuated.  The  resulting 
output  SINR  is  23.20  dB,  reflecting  a  loss  of  only  0.08  dB  in  reducing  the  rank  from  223  to  50. 
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Normalized  Doppler  Normalized  Doppler 


Figure  1.7:  The  reduced-rank  eigen-subspace  Wiener  filter  power  spectrum 


Sin(azimuth) 

Figure  1.8:  The  reduced  rank  cross-spectral  subspace  Wiener  filter  power  spectrum 
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Figure  1.9:  The  output  SINR  of  the  reduced-rank  processors  cls  a  function  of  the  rank  of  the  data 


The  performance  of  these  two  reduced-rank  processors  may  be  evaluated  by  plotting  the 
output  SINR  as  a  function  of  Wiener  filter  rank.  In  addition,  we  are  interested  in  comparing  two 
other  reduced-rank  processors. 

The  first  additional  reduced-rank  processor  under  consideration  is  provided  via  the  PCI  tech¬ 
nique.  This  method  estimates  the  full  rank  noise  covariance  matrix  =  Ri  +  Rc  +  by  the 
largest  principal  components  to  form  Ry^.  For  the  purpose  of  this  analysis,  the  corresponding 
eigenvalues  are  retained  in  the  estimate  as  opposed  to  making  the  common  approximation  that 
they  have  a  value  of  unity.  The  resulting  equivalent  direct  form  array  has  a  constrained  Wiener 


filter  w  given  by 


w  = 


(1.30) 


and  the  array  output  is  yr{k)  =  w^x(fc).  Note  that  the  weight  vector  is  full-rank. 

The  second  additional  reduced-rank  processor  under  consideration  uses  the  GSC  form  array 
with  the  DCT  unitary  operator  for  subspace  selection.  The  use  of  the  DCT  matrix  Q  in  place 
of  the  matrix  of  eigenvectors  U  in  Figure  1.3  permits  the  use  of  an  approximation  to  the  cross- 
spectral  metric.  Thus,  the  rank  reducing  transformation  U  in  Figure  1.4  is  composed  of  some  M 
columns  of  the  DCT  matrix  Q  to  be  selected  by  a  modified  version  of  the  cross-spectral  metric. 
This  modified  metric  replaces  the  numerator  of  (1.27)  with  the  mean-square  value  of  the  product 
formed  by  the  i-th  DCT  output  and  the  beamformed  signal  d{k).  The  denominator  is  replaced 
with  the  power  of  the  data  in  the  i-th  DCT  frequency  bin. 
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Figure  1.9  depicts  the  output  SINK  of  these  four  reduced-rank  processors  as  a  function  of 
the  rank  of  the  data.  For  the  three  GSC  processors,  the  rank  of  the  Wiener  filter  and  the  rank 
of  the  data  are  identical.  Thus,  this  plot  depicts  the  SINK  as  a  function  of  the  Wiener  filter 
rank  in  these  cases,  whereas  the  curve  corresponding  to  the  PCI  method  (utilizing  the  low-rank 
covariance  estimate)  requires  a  rank  224  weight  vector  for  each  value  of  the  rank  along  the  x- 
axis.  It  may  be  seen  from  Figure  1.9  that  the  cross-spectral  metric  with  the  eigenvector  basis 
performs  the  best,  obtaining  the  optimal  SINK  at  a  Wiener  filter  rank  of  approximately  56.  The 
largest  eigenvector  technique  using  the  GSC  requires  a  processor  of  approximately  the  dimension 
of  the  noise  subspace  eigenstructure,  corresponding  to  77  weights,  to  obtain  the  optimal  solution. 
The  cross-spectral  metric  approximation  with  the  DCT  basis  performs  better  than  the  GSC 
eigen-subspace  processor  until  the  rank  of  the  weight  vector  reaches  the  dimension  of  the  noise 
subspace  eigenstructure.  The  PCI-type  processor,  which  forms  a  low-rank  estimate  of  the  noise 
covariance  matrix,  performs  the  worst.  The  PCI  approach  does  not  obtain  the  optimal  solution 
until  the  covariance  estimate  nearly  utilizes  all  of  the  principal  components.  In  addition,  the 
PCI  processor  performs  very  poorly  until  the  number  of  principal  components  included  in  the 
covariance  estimate  reaches  the  rank  of  the  noise  subspace  eigenstructure.  Only  at  that  point 
does  the  PCI  technique  begin  an  exponential  convergence  to  the  full-rank  SINK;  conversely,  both 
cross-spectral  approaches  converge  exponentially  along  their  entire  trajectory. 

1.1.5  Conclusions 

A  cross-spectral  metric  for  subspace  selection  in  partially  adaptive  array  processing  is  derived  and 
a  proof  provided  to  show  that  this  metric  is  the  optimal  performance  measure  to  use  in  deciding 
which  eigenvectors  to  keep  for  rank  reduction.  Also  it  is  demonstrated  that  a  GSC  array,  oper¬ 
ating  in  the  subspace  selected  by  the  cross-spectral  subspace  estimator,  exceeds  the  performance 
realized  by  operation  in  a  subspace  which  is  based  upon  the  principal  components  method.  The 
cross-spectral  subspace  provides  a  better  performance  than  the  eigen-subspace  when  the  con¬ 
ventional  beamformer  pattern  provides  any  attenuation  of  the  interference.  An  example  and 
analysis  are  provided  on  the  assumption  of  exact  signal  knowledge,  although  estimated  statistics 
also  may  be  utilized  in  conjunction  with  an  adaptive  algorithm.  Finally,  the  performance  of  these 
reduced-rank  techniques  are  compared  to  the  PCI  method  and  a  low  computational  complexity, 
DCT  based,  approximation  to  the  cross-spectral  metric.  It  is  demonstrated  that  the  GSC  form 
array  methods  outperform  the  PCI  technique  and  result  in  reduced-rank  weight  vectors,  whereas 
the  PCI  method  always  requires  a  full-rank  filter. 


1.2  A  Simultaneous  CFAR  Detection  and  ML  Estimation  STAP 
Algorithm  for  Airborne  Radar 

The  algorithm  in  this  section  provides  both  a  CFAR  detection  and  a  maximum  likelihood  (ML) 
Doppler-bearing  estimator  of  a  target  in  a  background  of  unknown  Gaussian  noise.  A  target  is 
detected,  and  its  parameters  estimated  within  each  range  gate  by  evaluating  a  statistical  test  for 
each  Doppler-angle  cell  and  by  selecting  the  cell  with  maximum  output  and  finally  comparing  it 
with  a  threshold.  The  CFAR  performance  of  the  proposed  processor  is  analyzed  by  the  use  of  the 
sample  matrix  inversion  (SMI)  method  and  is  evaluated  in  the  cases  of  a  fully  adaptive  STAP 
and  two  partially  adaptive  STAP’s.  The  performances  of  these  criteria  show  that  the  probability 
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of  detection  is  a  function  only  of  the  sample  size  K  used  to  estimate  the  covariance  matrix  and  a 
generalized  signal-to-noise  ratio.  The  choice  of  the  number  K  is  a  trade-off  between  performance 
and  computational  complexity.  The  performance  curves  demonstrate  that  the  finer  the  resolution 
is,  the  poorer  the  detection  capability.  This  means  that  one  can  trade  off  the  accuracy  of  ML 
estimation  with  the  performance  of  the  CFAR  detection  criterion. 

1.2.1  Introduction 

A  requirement  of  future  airborne  radar  is  to  detect  targets  in  an  interference  background  which 
is  comprised  of  both  clutter  and  jamming.  Conventional  radars  are  capable  of  detecting  targets 
in  the  time  domain  (range)  and  the  frequency  domain  (Doppler  frequency).  Space-time  adaptive 
processing  (STAP)  radars  provide  an  additional  domain  (space)  for  the  detection  of  targets. 
STAP  is  a  two-dimensional  adaptive  filtering  algorithm  which  uses  both  temporal  and  spatial 
filtering  to  suppress  interference. 

In  the  receive  mode,  radar  target  echoes  are  received  usually  within  a  transmit  beamwidth. 
But,  depending  on  the  array  aperture  and,  more  importantly,  on  the  target  range,  the  angular 
sector  encompassed  by  the  mainlobe  of  the  receive  beam  may  correspond  to  a  fairly  large  swath 
of  space.  Thus,  many  measurements  and  an  interpolation  are  needed  to  more  precisely  determine 
the  bearing  of  the  target. 

Traditionally,  the  process  of  target  detection  and  estimation  is  carried  out  in  sequence  by  dif¬ 
ferent  operational  techniques.  However,  in  this  section  STAP  target  detection  and  ML  parameter 
estimation  are  obtained  simultaneously,  except  for  a  final  interpolation,  with  a  single  criterion. 
First,  the  fully  adaptive  test  (defined  in  [18]  as  fully  adaptive  if  the  process  refers  to  both  the 
spatial  and  temporal  domain)  is  based  on  the  assumption  of  known  signal  parameters,  range, 
Doppler,  the  elevation  and  azimuth  angles  and  a  known  noise  covariance  matrix.  Then,  this  test 
is  normalized  so  that  it  has  a  unit  variance.  Finally,  the  test  is  extended  to  find  the  maximum 
response  of  the  Doppler-angle  filter  bank.  The  system  block  diagram  of  this  fully  adaptive  STAP 
is  shown  in  Figure  1.10.  To  simplify  the  analysis,  the  responses  of  the  set  of  Doppler-angle  filters 
are  assumed  to  be  mutually  orthogonal.  Under  such  an  assumption  the  false-alarm  probability 
and  the  probability  of  detection  with  a  known  covariance  matrix  are  found  analytically  in  this 
section. 

Also,  the  statistical  performance  of  this  detector  is  extended  and  analyzed  by  replacing  the 
noise  covariance  matrix  by  a  maximum  likelihood  (ML)  estimate  of  the  noise  covariance  which 
is  based  on  actual  data.  It  is  demonstrated  here  that  this  makes  the  above  test  into  a  powerful 
CFAR  detection  test  which  simultaneously  estimates  both  the  Doppler  frequency  and  bearing 
angle  of  the  target. 

However,  the  computational  complexity  for  the  required  number  of  degrees  of  freedom  makes 
the  above  fully  adaptive  STAP  difficult  to  realize  in  practice  [18].  Two  classes  of  reduced- 
dimension  estimator-detectors,  which  are  called  partially  adaptive  STAP,  are  also  developed  and 
analyzed  in  this  section.  A  partially  adaptive  STAP  is  defined  to  be  a  processor  which  is  not 
fully  adaptive  simultaneously  in  both  the  spatial  and  temporal  domain.  The  partially  adaptive 
element-space  post-Doppler  STAP  is  shown  in  Figure  1.11,  and  the  partially  adaptive  beam- 
space  pre-Doppler  STAP  is  shown  in  Figure  1.12.  Finally,  these  two  suboptimal  STAP  criteria 
are  compared  with  the  above-described  fully  adaptive  STAP. 
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1.2.2  Formulation  of  the  Problem 

The  system  under  consideration  is  a  coherent  pulsed  Doppler  radar.  For  simplicity  the  radar 
antenna  is  assumed  to  be  a  uniformly  spaced  linear  array  antenna  consisting  of  N  elements. 
The  radar  transmits  a  coherent  burst  of  M  pulses.  For  each  pulse  repetition  interval  (PRI)  L 
range  samples  are  collected.  With  M  pulses  and  N  receiver  channels,  the  received  data  for  one 
coherent-processing  interval  (CPI)  comprises  LMN  complex  baseband  samples,  called  the  LMN 
data  cube. 

For  the  present  it  is  assumed  that  a  target  remains  in  one  range  gate  during  a  CPI.  The 
snapshot  of  the  MN  samples  of  data  for  the  range  gate,  which  contains  target  signal,  is  the  so- 
called  primary  data.  The  secondary  data  consists  of  the  outputs  of  K  range  gates  in  the  vicinity 
of  the  range  gate  of  a  target  to  be  detected.  This  set  of  K  range  gates  consists  of  a  subset  of  the 
remaining  (L-1)  range  gates. 

The  noise  components  of  the  data  vectors  consist  of  thermal  noise,  clutter  and  jamming  which 
are  modeled  as  zero-mean  complex  Gaussian  random  vectors.  The  noise  components  of  each  range 
gate  are  assumed  to  share  a  common  covariance  matrix  and  to  be  mutually  independent. 

For  the  analysis  it  is  desirable  to  assume  that  each  filter  is  statistically  independent.  This  is 
achieved  approximately  by  the  proper  spacing  between  adjacent  filters  and  beams. 

1.2.3  Derivation  of  the  Test  Statistic  (when  Noise  Covariance  is  Known) 

Let  X  be  a  MNxl  snapshot  of  a  given  range  gate,  i.e.  the  observation  vector, 

X  =  [a:ii,  a;i2)  ■ '  ^iN,  2^21)  3:22,  •  •  •  >  X2Nt' ' '  1  •  •  •  5  xmn]'^  )  (1-31) 

where  Xij  represents  the  sample  from  the  pulse  and  the  antenna  element. 

Two  hypotheses  are  postulated  : 

Hq:  X  =  n  ,  (1.32) 

Hi:  X  =  s -f  n  ,  (1.33) 

where  n  is  a  MN  X 1  Gaussian  noise  vector  with  known  covariance  matrix  R  and  s  is  a  known 

MNxl  target  vector  with  phase  angle  Doppler  frequency  Um  and  a  complex  amplitude  a 

with  random  phase  <f>.  Thus  s  can  be  represented  by  the  MN  dimensional  vector, 

s  =  ah{um)  ®  a('!?„)  =  a:v(w„,  #„)  ,  (1-34) 

where  b(a;TO)  is  a  Mx  1  temporal  steering  vector,  a(??„)  is  a  N X 1  spatial  steering  vector,  ®  denotes 
Kronecker  product  and  v(a;„,  -!?„)  =  b(a;„)  a(i9„)  is  a  MNx  1  steering  vector. 

Since  the  starting  phase  ^  is  a  random  variable,  the  random  vector  x  is  not  strictly  Gaussian. 
However,  when  conditioned  on  (f>i  x  is  complex  Gaussian  under  both  hypotheses  Hg  and  Hi. 
Thus  the  conditional  probability  densities  of  x  are 

Pi^\Hi,<f>) 

F(x|Ho) 


exp{-[x  -  a'vjoJm,  ^n)]^R~^[x  -  av(^m,  ^n)]} 


TT 


^IRI 


exp{-x"R  ^x) 


(1.35) 

(1.36) 
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where  a  =  |a;|e-^'**’  with  the  starting  phase  which  is  uniformly  distributed  between  [0,27r],  and 
R  =  ^{nn^}. 

The  likelihood  ratio  test  A^n  of  the  densities  in  (1.35)  and  (1.36)  is 
_  P(x|gi) 

P(x|P,)  ~  P(x|P,) 

r2Tv  .  j.  1 

=  exp|2Pe  a*v^(wm,i9„)R“^x  -  |ttpv^(a;TO,  ??„)R“V(u;m,  ??„)  j  — 

Hi 

=  exp|-|Q;pv^(a;^,7?„)R“^v(a;m,T?„)|7o  (2|q;|  |v'^(w„,i?„)R"^x|)  ^  To  (1.37) 

Ho 

where  Io{x)  is  the  modified  Bessel  function  of  the  first  kind.  Since  a  covariance  matrix  is  non¬ 
negative  definite,  i.e.  {um,  '&n)H~^v{um,  '&n)  >  0,  the  exponential  term  on  the  left  hand  side 
of  the  above  equation  can  be  absorbed  into  the  threshold.  Hence  the  above  test  becomes 

Hi 

/o  (2|q;|  |v^(Wm,  d„)R“^x  j  <  Toexp^\a\'^v^{oJm,'On)H~^v{um,'&n)]  =  Ti  .  (1.38) 

Ho 

Since  Io{x)  is  a  monotonically  increasing  function  of  positive  x,  this  test  further  reduces  to 
the  following  test  : 


(1.39) 


Finally,  a  normalization  with  respect  to  the  output  noise  power,  yields  the  following  optimum 
group  invariant  test  with  threshold  Tmn  for  ^  change  of  scale  and  starting  phase  angle  (f>  (see 
[26]): 


v^(a;„,d„)R  ^x 


v^(a;TO,'!?n)R  ^v(wm,i9„)  ^  v^(a;„,  ??„)R-iv(a;m,  ??„) 


(1.40) 


It  is  assumed  for  the  present  that  R  is  known  except  for  a  multiplicative  scale  factor.  The  test 
in  (1.40)  can  be  viewed  in  another  way.  Now  let 

R“^/^v(u;„,7?„)  =  v(w„,  ■!?„),  ,,  ^  =  v(a;^,dn)  and  R'^/^x  =  x  (1.41) 


A  substitution  of  the  above  transformations  into  (1.40)  the  test  rmn  becomes 


'^mn  —  V  (^mj"^7i)x  Tyt 

Ho 


(1.42) 


Thus,  the  test  is  equivalent  to  a  measure  of  the  projection  of  the  whitened  observation  x  onto 
the  unit  ‘‘whitened”  steering  vector  v(a;727,  i?7i). 

The  conditional  pdf  of  the  test  Tmn  under  Hi  can  be  derived  easily  by  the  following  theorem 
(see  [29]  and  [37]  ): 
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Theorem  1.1  //  x  is  Im)  and  B  is  an  m  x  m  projection  matrix  of  rank  k  then  x^Bx 

has  a  noncentral  distribution  where  S  =  m^Bm^;. 

In  (1.42)  X  is  Nmn  (q;R"^/V,  l)  and  v{u>m,^n)'v^ is  a  projection  matrix  of  rank  one. 
Then,  by  Theorem  1.1  tfie  test  r^n  is  equivalent  to  XiiPmn),  where  pmn  is  the  generalized  output 
signal-to-noise  ratio  of  the  detector  given  by 

Pmn  =  .  (1-43) 

Hence  the  conditional  pdf  of  under  Hi  is  given  by 

P{rmn\Hl)  =  {2y/Pmnrmn)U{rmn)  ,  (1-44) 

where  U{-)  represents  the  unit  step  function.  By  setting  pmn  equal  to  zero  in  (1.44),  the  condi- 
tional  pdf  of  rmn  under  Hq  is  obtained  as 

P{rmn\Ho)  =e-^”^”Uirmn)  ■  (1-45) 

For  the  situation  of  a  known  steering  vector  and  known  covariance,  the  false-alarm  probability, 
denoted  by  PpAm-ni  the  probability  of  detection,  denoted  by  Pomni  respectively  given  by 

'  ,  (1.46) 

Tmn 

POO  .  fOO  _ 

^Dmn  —  I  P{'^mn\Hi)drmn  =  I  ^  Iq  (2 Pmn ‘^mn) drmn  i  (1.47) 

dTmn  •^Tmn 

where  the  threshold  Tmn  for  a  given  false  alarm  probability,  is  given  by 

Tmn  =  -HPFArnn)  ■  (1-48) 

Actually,  the  target  steering  vector  v{um,'dn)  is  unknown  in  the  real  radar  problem.  For  a 
single  target,  the  standard  technique  is  to  evaluate  the  above  test  statistic  over  a  discrete  set  of 
Doppler-angle  vectors,  i.e.  forming  a  Doppler-angle  filter  bank,  and  declaring  target  presence  if 
the  maximum  filter  output  of  the  filter  bank  exceeds  the  threshold  Th-  Thus,  the  test  r  over  the 
filter  bank  is  given  by 


It  is  shown  in  the  next  section  that  this  maximization  over  the  filter  bank  also  provides  a  ML 
estimate  of  the  target  steering  vector.  That  means  that  the  detection  and  estimation  of  Doppler 
and  angle  are  performed  simultaneously. 

For  the  Doppler-angle  filter  bank  case,  by  the  independence  of  each  Doppler-angle  channel,  the 
false-alarm  probability,  Pfa,  and  the  overall  threshold,  Th  are  related  by  the  following  formulae: 

Pfa  =  Pr{  Max  (r„„)}  >  Th\Ho}  =  1  -  Pr{rn  <  Th,  r^  <  Th, vmn  <  Th\Ho} 

l<m<M 

l<n^A^ 

N  M 

=  1  -  n  n  <  Th\H,}  =  1  -  (1  -  e-’'*)*"  ,  (1.50) 

n—\  m=l 

and 

Th  =  -ln[l  -  {1- Pfa)'^]  ■  (4-54) 


Hi 

Y  ^  Th  .  (1.49) 

Ho 


J  ^X 

max  f  7^777,72,  T  —  max^  xjf  o  i  /  q  \ 

m,n  m,n  {u)m, '&n)^  {u>rn , n) 
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The  probability  of  detection  Pd  for  the  Doppler-angle  filter-bank  is  given  approximately  by 
=  Pr{  Max  (rmn)  >  Th\Hi}  =  1  -  Fr{rii  <  Th,  rn  <  Th, vmn  <  Th\Hi} 

l<m<M 

l<n<N 

MN-1 

=  l-Pr{rmn<Th\Hi}  H  Pr{n<Th\Ho} 

k=l 

=  I  -  T''  (1.52) 

Jo 

To  arrive  at  this  approximate,  but  quite  accurate,  result  for  large  MN,  it  is  assumed  that  the 
target  can  only  appear  in  one  Doppler-angle  channel. 


1.2.4  The  Maximum  Likelihood  Estimator  of  Target  Direction 


Assume  that  the  snapshots  x(i)  from  different  range  gates  i,  for  i=l,2,...,K,  are  independent. 
Then  the  joint  pdf  F(x(l),x(2), . .  .,x(A))  of  the  observed  data  from  different  range  gates  is 
given  by 


P(x(l),...,x(A)) 


np(x(f)) 

j=rl 

/ 1  \  MNK  .  (  ^  „  \ 

j  iRp-^  exp  1^-  X  j 


(1.53) 


Let  X  =  [x(l),x(2),...,x(A)].  Then  this  joint  pdf  F(x(l),  x(2), . . . ,  x(A'))  can  be  rewritten  as 
follows  : 


P(x(l),...,x(A)) 


/  N  MNK 
\nj 


exp  {-trace  (x^R-^x)} 
|R|“^  exp  trace  {^R  XX^^ 


(1.54) 


The  maximum  likelihood  estimate  of  the  covariance  R  is  that  value  of  R  which  achieves  the 
maximum, 

(^i)^^lRr'exp  (^-trace  (^R'^^XX^)) 

=  M2n|log  |R|  -b  trace  {r  ,  (1.55) 

where  S  =  By  taking  the  partial  derivatives  of  (1.55)  with  respect  to  the  components 

of  R  and  setting  the  result  to  zero,  yields 


Max 

R 


d 

dK 


j^log  |R|  +  trace  {r 


=  0 


(1.56) 


or 


(r-i)^-  (r-^SR-^Y  =  0 


(1.57) 
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so  that  finally  the  ML  estimate  is  given  by 

R  =  S, 


(1.58) 


where  ^  denotes  the  partial  derivatives  with  respect  to  the  components  of  R 

Next,  the  maximum  likelihood  (ML)  estimate  of  the  complex  amplitude  a  and  the  steering 
vector  v{oJrn^  '&n)  ^re  derived  by  a  use  of  the  ML  estimate  R  in  (1.58).  The  required  ML  estimate 
of  the  parameters,  a  and  '&n)i  are  defined  by  the  pair  (d,  v(a;m,  '&n))-  This  estimate  is  that 

pair  (a,\{u}m)'&n))  which  attains  the  maximum  in  the  following  maximization  process  : 

Max  P  fx|a:,v(a;TO,  ■!?„),  R)  =  Max  logP  (x|a,  v(wm,'!?„),R) 

a,v(u;m,^n)  ^  '  0(fV({jJrn,Vn)  ^  ^ 

=  Max  |[x  -  Q;v(a;m,??„)]^R“^  [x  -  Q;v(a;m,i9„)]| 

Or,V{a;m,»n)  ^ 

=  Max  0  (a,  v(a;TO,^n))  ,  (1-59) 

Of  ,V  (0/771 

where  ^  v((^77i,  ?9n))  is  the  log-likelihood  function  (excluding  the  constant  terms).  This  maxi¬ 

mization  is  performed  first  over  a,  as  in  [33]  and  [17],  to  obtain  that  value  of  ot  for  which 


or 


dot 


d 

—fif  (a,  v(u;„j,  ??„))  =  0  , 


x^R  ^x  -  ox-^R  ^v(a;m,  ??„)  -  Q;*v^(a;m,  i9„)R  ^x 

+Q;*Q;v^(u;m,  ??„)R“^v(u;m,  ■!?„)]=  0  , 


or 


x^R„^v(w„,  r?„)  -  a*v^(u;„,  •!?„)R  V(u;„,  ??n)  =  0  . 


(1.60) 


(1.61) 


(1.62) 


Hence  one  obtains 

&= _ -  ,  (1.63) 

A  substitution  of  the  above  result  for  d,  the  estimate  in  (1.63)  of  the  amplitude  a,  into  the 
log-likelihood  function  in  (1.59)  yields  finally  the  ML  estimate  which  is  that  values  of 

v(cj,„,'!?„)  which  achieves 


Max 

'V{wm,'8n) 


v^(a;TO,'!?n)R 

^  1 

(1.64) 


1.2.5  Performance  Analysis  of  the  Detector  (when  the  Noise  Covariance  is 
Unknown) 

In  the  real  radar  problem,  the  noise  covariance  is  unknown.  The  best  way  to  find  the  required 
maximum  likelihood  ratio  test  for  detection  is  to  replace  the  noise  covariance  matrix  with  its 


20 


maximum  likelihood  estimate  using  the  observed  data  for  a  set  of  different  range  gates  in  the 
vicinity  of  the  range  gate  of  the  target  to  be  detected.  The  test  statistic  under  the  signal-plus- 
noise  hypothesis  Hi  is  derived  first.  If  R  is  replaced  by  R  in  (1.40)  and  ^(XX-^)  is  substituted 
for  R,  the  test  Cmn  becomes 


^  mn 


(1.65) 


It  is  assumed  that  the  observed  data  from  the  different  range  gates  are  independent  and  have  the 
same  statistics.  Then  a  whitened  X  is  obtained  as  follows  : 


X  =  [x(1),x(2),...,x(K)]=[R1/V(1),R1/2w(2),...,R1/V(A')]  =  R'/'W  ,  (1.66) 


where  W  =  [w(l),  w(2), ...,  w(A')]  is  an  MN  x  K  matrix  comprised  of  independently  identically 
distributed  (i.i.d.)  complex  Gaussian  random  variables  with  a  zero  mean  and  a  unit  variance, 
each  random  variable  [w(/)]i,  for  i=l,...,MN,  /=!,... ,K,  is  distributed  N(0,1).  By  substituting 
r1/2w  for  X,  the  test  rmn  now  becomes 

_  |v^(^m,  t?„)R-i/^(WW")-iR-^/^xp  ^  ^  , 

v^(w,„,i?„)R-i/2(WW^)-iR-i/2a(7?,)  K  ^  ’ 

Let  Q  be  an  arbitrary  unitary  and  Hermitian  matrix  and  w  be  a  white  Gaussian  random  vector 
with  Cov(w)  ^  ImN'  The  mean  and  covariance  of  Qw  is  the  same  as  that  of  w.  Thus,  Qw 
and  w  are  equivalent  statistically  since  they  are  both  Gaussian  and  have  the  same  second  order 
statistics.  It  follows  that 


QW=  W  and  WW^  =  QWW^Q  , 


(1.68) 


where  ’  denotes  equality  in  distribution.  Then  r^nn  in  (1-67)  becomes 


|v^(c^m,^n)R~^/^Q(WW^)-lQR-l/^Xp 


v^^(a;„,r9„)R-i/2Q(WW^)-iQR-i/2v(a;„,i9„)  ^ 


<C  ^  mn 
o 


(1.69) 


By  the  Householder  transformation  (see[35],[34]  and  Appendix  A)  one  can  find  next  a  unitary 
matrix  Q  such  that 


[f^m  5 


J^i 


(1.70) 


where  ei  =  [1,0,  ...,o]^  is  a  unit  vector,  0i  =  n  +  [z]i  —  first  coordinate 

component  of  z,  and  arg{x  +  jy)  =  tan~^{y/x).  By  the  above  result  one  obtains  the  following 
equivalent  test  : 


|^v^(a;„,r?„)Rlli'a(T?,)ef(WW^)-iQR-ix|2  Hi 


(1.71) 
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|ef(WW^)-iQR-i/2x|2  , 


^  rp 

<i  mn  ? 


(1.72) 


where  [(WW^)  ^]i^i  represents  the  (1,1)  scalar  element  of  the  matrix  (WW^)  In  the  above 
equation,  QR-i/^x 

can  be  simplified  in  the  following  manner: 

QR“^/^X  =  QR“^/^  {av{LOrn,  ^n)  +  n) 

=  exp  |j  [n  +  arg  ([R"^/^v(a;m,  i?n)]i))}  i?„)R-iv(a;TO,  ^?n)ei  +  Qw 

=  u  (1.73) 


where  w  is  a  MN  x  1  white  Gaussian  random  vector  and  u  is  a  MN  x  1  vector  with  inde¬ 
pendent  entries,  N(0,1),  except  that  the  first  component,  lii,  which  is  N(m^^^,l)  with  ~ 

yjpmn  6xp  (n  +  arg  A  substitution  of  (1.73)  into  (1.72)  yields  the  test, 


|ef  (WW^)-iu|2 


■'-d  [(WW^)-i]i,i  Ho  "  ' 

Before  deriving  the  pdf  of  the  one-filter  test  rmn  in  (1-74),  the  pdf  of  [(WW'^)“^]i4  is  derived 
first.  Let  u  be  a  MN  X  1  vector  which  is  partitioned  as  follows: 


(1.74) 


Ui 

U  =  / 


(1.75) 


where  ui  is  a  scalar  and  is  a  (MN-1)  X  1  vector.  Next,  W  is  partitioned  in  the  manner, 


2-1 


(1.76) 


where  is  a  1  X  K  vector  and  is  a  (MN-1)  x  K  matrix.  From  (1.76)  one  obtains  the 
following  partition  for  WW^: 


Si 


All  Ai2 
A21  A22 


(1.77) 


From  [37],  the  inverse  (WW'^)  ^  of  the  partitioned  matrix  WW'^  is  given  by  the  following 
formula; 


(ww")-^  = 


^  [(WW^)-i]^^  [(WW^)-i]^ 


(WW^)-i]^^  [(ww^)- 


Bii  Bi2 
B21  B22 


(1.78) 


where 


Bii  = 

^Aii  -  Ai2A22^A2lj 

B21  = 

"■A22^A21  (Aii  -  A12A, 

Bi2  = 

-  (Aii  -  Ai2A22^A2i) 

'2  A21)  ) 


B22  =  A22^A21  ^Aii  -  Ai2A22^A2i)  Ai2A22^  +  A22^ 


(1.79) 
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Then,  by  the  above  formula  7]  = 


WW 


-1 


1,1 


is  obtained  as 


=  (zfzi-zfZ  (Z^Z)"'Z% 


-1 


^  Xk-N+1 


(1.80) 


(see  Appendix  B),  where  xh  is  a  complex  normalized  chi-squared  random  variable  with  m  degrees 
of  freedom.  A  substitution  of  (1.79)  into  (1.78)  and  then  along  with  (1.80)  into  (1.74),  yields  the 
following  result  for  r^n  : 


T  mn.  — 


Ui 

’(WW)"^ 

+ 

Jl,l 

’(WW)-^' 

2 

/ 

U 

1,2 

1 

(WW)"^’ 

1,1 

=  V 


-zfZ(Z^Z)"^u 


(1.81) 


Since  the  summation  of  the  independent  Gaussian  random  variables  is  also  a  Gaussian  random 
variable,  one  obtains 


rii  -  zf  Z(Z^Z)-'u'  =  iV(m'^_,l)-(zfZ(Z^Z)-2z%)'/2^(0,l) 


(l  +  zfZ(Z^Z)"^Z^zi) 


1/2  , 
u 


(1.82) 


where  u  is  1).  Thus  r^n  in  (1-81)  becomes 


d  ; 


(1.83) 


Here  C  is  equivalent  to  a  beta  distributed  variate  (see  Appendix  B)  defined  as  follows  : 


C  =  (i  +  zfz(z"z)"2z%) 

The  pdf  of  this  beta  distribution  is  given  by 

1 


-1 


(1.84) 


/"(C)  = 


rC 


K-MN+l 


B{MN  -1,K  -  MN  +  2) 
where  B(M,N)  is  the  beta  function  defined  by 

(M-  1)!(A-  1)! 


(1-0 


MN-2 


(1.85) 


B{M,N)  = 


(M  +  iV-  1)! 


(1.86) 


Since  the  magnitude  squared  of  a  nonzero  mean,  say  m,  Gaussian  random  variable  is  equivalent 

to  a  non-central  x^{m^m)  variate,  one  has  |up  =  Xi(Pmn)-  This  test  r^n  in  (1.83)  finally 

d 

becomes 


_  Xl  jPmn) 

^  Xlc-MN+I^ 


(1.87) 


Since  {n2XniiPmn)) / {niXni)  ^  noncentral  F  distribution  (see[31],  [29]  and  [37]),  the  conditional 


pdf  of  rmn,  under  C  and  Hi,  is  given  by 

g“Pmn 
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where  iFi{a;b;x)  is  the  confluent  hypergeometric  function  deflned  by 

in  (  h  ^  o(q  +  1)  •  •  •  (q  +  ~  1)  3:^ 

6(6+l)---(6  +  ib-l)  k\' 

Averaging  P{t'mn\C^  -^i)  over  one  has  the  conditional  pdf  of  under  H\i  as  follows. 


(1.89) 


j. 

P{rmn\Hi)  =  5(1^  K  -  MN  +  1)J5(MA  -  1,  A  -  MN  +  2)  ^ 


rl  ^—pmn^K—MN-\-2 


Setting  pmn  =  0  in  the  above  equation,  the  conditional  pdf  of  under  Hq,  is  given  by 

P{rmn\Ho)  =  5(1^  K  -MN  +  1)B{MN  -  1,  K-MN  +  2)  Jo  +  d(llM) 

From  (1.46)  and  (1.91)  the  false-alarm  probability  PpAmn  ^  single  Doppler-angle  Alter  is 
given  finally  by 

p _ ^ _ X 

^FAmn  -  b{1,K-MN  +  1)B{MN  -l,K-MN  +  2) 

=  - . i  (l-C)"''-“<iC  .(1-92) 


MN-2 


PmnC^r 


POO  rl 

JT'mn  Jo  (1 

Jo  ( 1  +  T'C 


~  B{MN-l,K-MN  +  2)Jo  \l  +  T;^r.CJ 

From  (1.50)  and  (1.91)  the  false-alarm  probability  of  the  Doppler-angle  filter  bank,  Pfa,  is  given 


V5(l,  K  -  MN  +  l)B{MN  -  1,  K  -  MN  +  2)^ 


/  f 

Jo  Jo 


(1  + 


(1.93) 


Note  that  the  above  expressions  for  PpAmn  ^fa  are  functions  only  of  K(  the  number  of 
range  gates  used  to  estimate  the  covariance  ),  N  ( the  number  of  antenna  elements  ),  and  M  (the 
number  of  pulses  in  CPI).  Hence  these  false-alarm  probabilities  are  independent  of  the  signal  and 
noise  power.  As  a  consequence,  the  test  in  (1.65)  is  a  CFAR  criterion. 

The  probability  of  detection  for  the  single  Doppler-angle  channel  case  is  obtained  from  (1.47) 
and  (1.90  as  follows  : 


5(1,  K  -  MN  -b  1)B{MN  -1,K-  MN  -|-  2) 


III 


1  ^-Pmn(^K-MN+2^-j^  _  Q 


MN-2 


(1  +  rmnC) 


K-MN+2 


iFi(A'  -MN  +  2, 1, 

1  +  ran 


^"“Pmn 

B{MN  -1,K  -  MN  +  2)  ^ 

rl  °° 

/  ^K-MNAl  _  qMN-2  Pmn  j 


k\ 


{k  +  l,K-MN+l)dC,  (1.94) 


l  +  C^mn 
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where  Ix{(i,b)  is  an  incomplete  beta  function. 

Similarly,  the  probability  of  detection  for  the  Doppler-angle  filter  matrix  case  can  be  obtained 
finally  from  (1.52), (1.90)  and  (1.91)  as  follows: 


Fd  =  1- 


r/‘ 


B{1,  K  -MN  +  1)B{MN  -  1,  K  -MN  +  2) 

-Pmn(^K-MN+2^l  _  qMN-2 


(l  +  r^nC)^'-'^'^+2 


-iFi(A'-MiV  +  2,l, 


Pmn^'^r 


F(l,  K  -MN+  l)F(MiV  -  1,  K  -MN  +  2) 
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The  probabibity  of  detection  as  a  function  of  the  generalized  signal-to-noise  ratio,  pmn,  is 
shown  in  Figures  1.13  and  1.14  for  M=N=10  and  M=N=5,  respectively.  This  leads  to  the  two 
kinds  of  curves,  plotted  on  the  same  figures,  one  for  a  single  channel  and  another  for  a  filter 
bank.  With  the  same  parameters  (M,  N  and  K),  the  performance  of  the  single  channel  is  better 
than  that  of  the  filter  bank.  From  Figures  1.13  and  1.14,  one  can  find  the  effect  of  increasing  the 
numbers  M  and  N  in  the  filter  bank  on  the  performance.  As  the  number  of  filters  in  the  filter 
bank  becomes  larger,  the  performance  of  the  filter  bank  is  much  degraded  from  that  of  the  single 
channel.  The  effect  of  sample  size  K  on  the  performance  is  not  very  significant  for  M=N=10  in 
Figure  1.13,  but  about  2  dB  of  degradation  as  K  goes  from  5MN  to  2MN  as  shown  on  Figure  1.14 
for  M=N=5. 


1.2.6  Performance  Analyses  of  Two  Partially  Adaptive  STAP  Detectors 

Two  types  of  partially  adaptive  STAPs,  element-space  post-Doppler  STAP  and  beam-space  pre- 
Doppler  STAP,  are  discussed  in  this  section.  Both  of  these  criteria  have  the  advantage  of  reduced 
dimension  but  both  sacrifice  some  performance.  The  element-space  post-Doppler  STAP  uses  non- 
adaptive  temporal  filtering  first  and  then  performs  the  spatial  filtering  adaptively.  In  contrast,  the 
beam-space  pre-Doppler  STAP  performs  in  the  reverse  order,  i.e.  non-adaptive  spatial  filtering 
first,  followed  by  adaptive  temporal  filtering. 

CASE  1:  Element-Space  Post-Doppler  STAP 

Let  Ym  be  the  N  X 1  output  vector  of  the  Doppler  filter.  Then  a  snapshot  of  data  in  the 
presence  of  a  target  x  within  the  Doppler  filter  is  expressed  by 


ym  =  (fm  0  In)^x  =  (f,^  (g)  In)(s  +  n)  =  af^h{LJrn)  a('i?„) -b  (fm  0  lAr)^n  ,  (1.96) 


where  represents  a  Mxl  DFT  vector  for  the  Doppler  filter. 

Since  the  DFT  is  a  set  of  M  linear  operations,  one  has  rim  =  (fm  0  Iv)'^n  so  that  the  output 
noise  of  the  filter  is  a  Gaussian  random  vector.  The  Doppler  channel  output  noise  vectors 
for  m=l,2,---,M  are  statistically  independent  [38]  by  the  assumption  that  the  correlation 
structure  of  the  noise  is  much  less  than  the  duration  of  the  observations.  The  detection  test  with 
a  known  covariance  matrix  for  this  element-space  post-Doppler  STAP  is  given  by 


r  =  max< 
m,n  I 


|a^(-i9n)RmVmP  \ 


< 

H. 


Tmn 


(1.97) 
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where  is  the  Doppler  channel  output  noise  covariance.  The  maximum  of 

this  test  is  obtained  over  all  beams  for  each  Doppler  filter.  This  STAP  detector  suppresses  the 
interference  and  noise  by  adaptively  placing  nulls  in  the  angular  domain  only,  and  using  DFT 
filtering  in  the  Doppler  frequency  domain. 

The  ML  estimate  of  the  target  angle  for  a  given  Doppler  frequency,  is  obtained  by 

the  same  argument  used  in  the  previous  section  for  estimation.  That  is,  a{'dn)  is  that  value  of 
which  obtains  the  maximum. 


max 

a(T9n) 


a^{'dn)Rjym 


(1.98) 


Note  that  this  STAP  detector  provides  simultaneous  detection  and  ML  estimation  for  the  target 
angle  only. 

Similarly,  the  false-alarm  probability  for  the  single  Doppler  channel,  PpAmni  for  the 
Doppler-  filter  bank,  PpAi  are  given  respectively  by 

PpAmn  =  ,  (1.99) 


Pfa  =  l-(l-e-^'>)' 


(1.100) 


The  probability  of  detection  for  a  single-Doppler  channel,  Pomni  and  for  the  Doppler-filter  bank, 
Ff),  are  given,  respectively,  by 


Iq  (2\/pjn7i^mn)dr’„ 


(1.101) 


Fo  =  1- 


I 0  (2  6  ^) 


Th\MN-l 


where  the  generalized  signal-to-noise  ratio  pmn  is  given  by 

Pmn  =  |«fmb(w^)|^a^(??n)R;;^a('!?„)  . 


(1.102) 


(1.103) 


When  the  noise  covariance  R^  of  the  Doppler  filter  output  is  unknown,  R^  in  (1.97)  is 
replaced  by  its  ML  estimate  R,^!  which  is  given  by 


(1.104) 


where  yTO(^)  represents  the  observed  data  at  the  Doppler  filter  output  from  the  nearby 
range  gate.  Then,  the  false-alarm  probability  with  unknown  covariance  for  single-Doppler 
channel,  PpAmni  and  for  the  Doppler-filter  bank,  PpAi  are  given,  respectively,  by 


F(l, K-N+  1)B{N  -1J<  -  N  +  2){K  -  N+1) 

rl  /  t  \  — iV+1 

i  (irk?) 


(1  -  cr-^dc 


(1.105) 


Pfa  =  1 


rl  ^K-N+2(-i  _  (-■\N-2  \ 

— — -  .^k-n+2  dCdrmn)  (1.106) 


F(l,  K-N  +  l)BiN  -1,K-N  +  2)  Jo  (1  +  r^nC) 
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The  probability  of  detection  Pomn  unknown  covariance  for  a  single  Doppler  channel  and 
Pd  for  the  Doppler- filter  bank  are  given,  respectively,  by 


B{N  -1,K  ~  N  +  2) 


f  _  q1 

Jo 


oo  ~k 

ct'  {k  +  i,K-N  +  i)d(: 


(1.107) 


Pd  =  1 


5(1,  K-N  +  l)B{N  -l,K-N  +  2) 


cl  p-Pmn^A'-iV+2^  _ 


PmnC'^r 


1  /•!  _  ^^N~2 

5(1, K-N+  1)B{N  -1,K-N  +  2)  Jq 


MiV-l 


(1.108) 


CASE  2:  Beam-Space  Pre-Doppler  STAP 

Let  Zn  be  the  Mxl  output  vector  of  the  beam.  Then  a  snapshot  of  data  x  with  a  target 
present  and  z„  are  related  by 

Zn  =  (lM<8)g„)^x=  (lM®gf)(s  +  n)  =  ag^h{u:m)  a(i?„) -h  (Im  ®  g„)^n  ,  (1.109) 

where  gn  represents  a  Nxl  spatial  DFT  vector  for  the  beam. 

By  the  same  arguments  used  in  Case  1  the  final  detection  test  with  a  known  covariance  matrix 
for  the  beam-space  pre-Doppler  STAP  is  given  by 


r  —  max^ 


c;^  ^  mn  ? 


(1.110) 


where  represents  the  output  noise  covariance  of  the  beam.  In  contrast  to  Case  1  the  beam- 
sp'ace  pre-Doppler  STAP  suppresses  the  interference  by  placing  nulls  in  the  Doppler-frequency 
domain  and  in  the  angle  domain  by  using  fixed  (non-adaptive)  spatial  DFT  filtering  only. 

Similarly,  the  ML  estimate  b(a;^)  of  Doppler  frequency  is  that  value  of  b(cj7^)  which  obtains 
the  maximum. 


b^(u;„)R„^z„ 

max  <  — ; - ^ ^ - 

b(wm)  b^(a;m)Rm^b(u;m) 


(1.111) 


In  contrast  to  Case  1,  the  detector  for  this  case  provides  simultaneous  adaptive  detection  and 
ML  estimation  of  the  Doppler  frequency  only. 

All  of  the  performance  results  of  the  test  for  the  beam-space  pre-Doppler  STAP  are  obtained 

by  simply  exchanging  N  with  M  and  substituting  pmn  =  ®S^^(^n)  for  Pmn 

in  the  corresponding  generalized  signal-to-noise  ratio  for  the  element-space  post-Doppler  STAP. 

The  probability  of  detection  of  these  partially  adaptive  STAP’s  as  a  function  of  the  gener¬ 
alized  signal-to-noise  ratio  pmni  is  shown  in  Figures  1.15  and  1.16  for  M=N=10  and  M=N=5, 
respectively.  The  effect  of  sample  size  K  on  the  performance  is  presented  in  theses  figures. 
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Performance  comparisons  between  fully  and  partially  adaptive  STAP  are  made  in  Figures  1.17 
and  1.18  for  M=N=10  and  M=N=5.  The  performance  difference  between  partially  and  fully 
adaptive  STAP  is  not  large  for  a  sample  size  K  which  is  five  times  the  number  of  degrees  of 
freedom  for  M=N=10  and  M=N=5.  With  sample  size  K  which  is  twice  the  number  of  degrees 
of  freedom  the  performance  degradation  of  partial  STAP  with  respect  to  full  STAP  processing 
is  about  5  dB  for  M=N=10  and  about  7  dB  for  M=N=5.  Interestingly,  the  performance  of  the 
fully  adaptive  STAP  with  M==N=:5  and  K=5MN  is  about  the  same  as  that  of  partially  adaptive 
STAP  with  M— N=5  and  K=5N. 

1.2.7  Conclusion 

A  simultaneous  CFAR  detection  and  ML  estimation  algorithm  is  developed  for  both  fully  and 
partially  adaptive  STAP  in  this  study.  The  performance  of  these  CFAR  tests  are  analyzed  for 
a  known  and  a  maximum  likelihood  estimated  covariance  matrix.  The  performance  curves  of 
the  probability  of  detection  show  that  the  performance  is  a  function  of  the  sample  size  K  used 
to  estimate  the  covariance.  The  choice  of  the  number  K  is  a  tradeoff  between  performance  and 
computational  complexity.  The  optimum  number  is  between  two  and  five  for  K.  The  performance 
curves  also  show  that  the  finer  the  resolution  is,  the  worse  the  detection  performance  is  expected 
to  be. 

1.3  A  Fast  CFAR  Detection  Space-Time  Adaptive  Processing 
Algorithm 

All  of  the  conventional  CFAR  detection  algorithms  which  use  space-time  processing  involve  a 
time-consuming  matrix  inversion  operation.  Based  on  today’s  technology,  this  computational 
complexity  sometimes  makes  the  full-rank  solution  unrealizable.  In  this  section,  a  CFAR  detection 
algorithm,  which  does  not  need  a  matrix  inversion,  is  developed  by  an  adaptation  and  extension  of 
Hotelling’s  principal  component  method  studied  recently  [13]  by  Kirsteins  and  Tufts.  If  also  the 
HT  method  [30]  is  incorporated  into  this  new  algorithm,  the  computational  complexity  is  reduced 
to  0(N)  from  O(N^)  for  the  matrix  inversion  used  in  the  conventional  adaptive  algorithm. 

1.3.1  Introduction 

Future  airborne  radar  are  expected  to  be  required  to  provide  long-range  detection  of  targets  with 
increasingly  smaller  cross-sections.  This  function  must  be  performed  in  environments  where  the 
clutter  can  be  quite  severe.  Space-Time  Adaptive  Processing  (STAP)  refers  to  multidimensional 
adaptive  filtering  algorithms  that  simultaneously  combine  the  signals  from  the  elements  of  an 
array  antenna  and  the  multiple  pulses  of  a  coherent  radar  waveform  to  suppress  interference  and 
provide  target  detection. 

The  first  published  work  on  STAP  for  radar  was  by  Brennan  and  Reed  [17]  in  1973,  in  which 
optimum  space-time  filtering  was  described.  Later  in  1986  a  more  all-encompassing  generalized 
likelihood  ratio  test  (GLRT)  on  STAP  for  radar  detection  was  derived  by  Kelly  [24].  Kelly  used 
the  sample  matrix  inverse  (SMI)  technique  of  Reed,  Brennan  and  Mallett  in  [25].  Recently  Chen 
and  Reed  [26]  developed  a  simpler  CFAR  detection  test  than  the  CLRT  obtained  by  Kelly.  Robey, 
Fuhrmann,  Kelly  and  Nitzberg  [27]  also  developed  independently  the  same  result  as  Chen  and 
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Reed.  Although  this  new  CFAR  test  showed  a  slightly  degraded  performance  ,  it  was  easier  to 
analyze  than  the  GLRT.  However,  all  of  the  previous  STAR  algorithms  involved  a  time-consuming 
matrix  inversion  operation. 

The  principal  components  (PC)  technique  was  developed  originally  by  Hotelling  [28].  Hotelling 
showed  that  the  dimension  of  the  problem  often  could  be  reduced  without  sacrificing  too  much 
of  the  information  contained  in  the  covariance  matrix.  Muirhead’s  book  [29]  is  an  excellent  refer¬ 
ence  for  the  PC  method.  A  new  unnormalized  CLRT  which  uses  the  PC  technique  is  developed 
by  Kirsteins  and  Tufts  in  [13],  but  they  do  not  obtain  the  probability  density  functions  (PDF) 
needed  to  evaluate  the  performance  of  the  test  statistic. 

In  this  study  the  normalized  LRT  in  [26]  and  [27]  is  modified  by  a  use  of  the  PC  technique. 
This  test  exhibits  the  very  desirable  property  that  no  matrix  inversion  is  needed  and  that  the 
computation  complexity  can  be  reduced  to  0(N)  by  incorporating  the  HT  method  found  recently 
by  Hung  and  Turner  in  [30]. 

The  PDF  of  this  new  detection  statistic  is  derived  here  for  both  the  noise-alone  case  and 
the  signal-plus-noise  case.  In  the  noise-alone  case,  the  PDF  proves  to  be  very  simple  and  the 
probability  of  a  false-alarm  (PFA)  is  shown  to  depend  only  on  the  number  of  parameters  and  the 
sample  data  size.  Thus  this  new  PC  test  is  a  CFAR  criterion. 

1.3.2  Derivation  of  the  Test  Function 

The  system  under  consideration  is  a  coherent  pulsed  Doppler  radar.  For  simplicity  the  radar 
antenna  is  assumed  to  be  a  uniformly  spaced  linear  array  antenna,  consisting  of  A'a  elements. 
The  radar  transmits  a  coherent  burst  of  Np  pulses.  For  each  pulse  repetition  interval  (  PRI  )  Nr 
range  samples  are  collected.  With  Np  pulses  and  Na  receiver  channels,  the  received  data  for  one 
coherent-processing  interval  (  CPI  )  comprises  NrNpNa  complex  baseband  samples,  the  so-called 
NrNpNa  data  cube. 

It  is  assumed  in  the  present  treatment  that  a  target  remains  in  only  one  range  gate  during 
a  CPI.  The  snapshot  of  the  N  =  NpNa  samples  of  data  for  the  range  gate,  which  is  assumed  to 
contain  the  target  signal,  is  called  the  primary  data.  The  secondary  data  consists  of  the  outputs 
of  L  range  gates  in  the  vicinity  of  the  range  gate  of  the  target  to  be  detected. 

The  noise  components  of  the  data  vectors  consist  of  thermal  noise,  jamming  or  clutter,  which 
are  modeled  as  zero-mean,  complex  Caussian,  random  noise  vectors.  The  noise  components  of 
each  range  gate  are  assumed  to  share  a  common  covariance  matrix  and  to  be  mutually  indepen¬ 
dent. 

Let  X  be  a  Nxl  snapshot  of  a  given  range  gate,  i.e.  the  observation  vector, 

x=[xi,X2,---iXn]'^  •  (1-112) 

Two  hypotheses  are  postulated  : 

Ho:  X  =  n  ,  (1.113) 

Hi:  X  =  s  +  n  ,  (1-114) 

where  n  is  a  N  x  1  Caussian  noise-plus-clutter  vector  with  known  covariance  matrix  R  and  s  is 
a  known  Nxl  target  vector  with  a  complex  amplitude  a  and  a  random  phase  (f>.  Thus  s  can  be 
represented  by  the  N-dimensional  vector,  s  =  ov,  where  v  is  a  Nxl  steering  vector.  Then  the 
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usual  decision  rule,  which  is  based  on  the  matched  filter  for  incoherent  detection  (see  [26]),  is 
given  by 


r=  v^R"^x  ^  To. 


(1.115) 


A  normalization  with  respect  to  the  output  noise  power  yields  a  group-invariant  test  with  respect 
to  the  groups  of  change-of-scale  and  starting  phase  as  follows: 


v^R-^x  Hi 
v^^R-iy  ^ 


>  rp' 

<  ^0  ■ 


(1.116) 


This  test  is  independent  of  the  scale  of  the  assumed  known  covariance  matrix  R.  The  generalized 
output  signal-to-noise  ratio,  associated  with  the  detection  test  in  (1.116)  is  given  by 


E{r\x  =  Qfv} 


(1.117) 


S{r|x  =  n}  i“i  •  “  >  V  / 

where  E{}  denotes  the  expectation-value  operator. 

1.3.3  Rank  Reduction  of  the  Clutter-Plus-Noise  Covariance 

The  NxN  covariance  matrix  R  of  the  background  clutter  and  noise  can  be  modeled  as 

R  =  -f  (1.118) 

where  represents  the  covariance  of  the  thermal  noise  and  equals  the  covariance  of 

the  interference.  Here  U  is  the  NxN  eigenvector  matrix,  and  #  is  the  eigenvalue  matrix  of  R 
with  the  eigenvalues  Ai  >  A2  >  ...  >  A;y  >  0.  The  following  matrix  inversion  formula,  verified  in 
the  Appendix  D,  namely, 


(A  +  BCD)"^  =  A"^  -  A-^B  (dA-^B  +  C"^)  ^  DA'S 
readily  yields  the  inverse  of  the  noise  covariance  in  the  following  form: 

R-^  = 

=  In-V  + 

=  \1n  -  UAU^]  , 

where 


(1.119) 


(1.120) 


(1.121) 
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Note  that  +  a^)  =  I  if  Xi  >  and  Xi/{Xi  +  cr^)  =  Ai/a^  if  Xi  <  From  these 

observations,  noted  first  by  Kirsteins  and  Tufts,  R“^  can  be  approximated  by  a  reduced  rank 
matrix,  denoted  by  using  the  principal  component  method.  The  reduced  rank  k  of  R“^  is 

selected  in  accordance  with  the  condition, 

Xk  >  and  <  ta^, 


where  t  is  a  specified  number  greater  than  1.  According  to  the  above  selected  value  of  k,  the 
matrices  A  and  U  are  partitioned  as  follows: 


Afc  0 


0  A,  ’ 


U  =  [U;i,U,], 


(1.122) 


where  Ak  is  a  kxk  diagonal  matrix,  A^  is  a  (N-k)x(N-k)  diagonal  matrix,  Ua,  is  a  Nxk  matrix 
and  Ur  is  a  Nx(N-k)  matrix.  In  terms  of  the  partitions  of  matrices  A  and  U,  the  covariance 
matrix  inverse  R”^  in  (1.120)  becomes 


R-l  =  [I;v  -  UfeA^Uf  ]  -  a-^UrArUf , 


(1.123) 


where  Ak  —  Ik 


A  —  — 


XkJ{.i  0  ...  0 

0  Xk-[^2  •••  0 


(1.124) 


0  0  ...  Xn 

Now,  in  the  principal  component  method  R”^  is  approximated  by  the  matrix  R^^,  given  by 


r\j 


liv  -  U^AfeUf  ■ 


(7-2  [l^  -  U^Uf  ] 

a-^Pk, 


+  0{a-^) 


(1.125) 


where  O(-)  denotes,  “order  of”,  A^  =  h  and  represents  the  projection 

operator  of  the  subspace  which  is  orthogonal  to  the  principal  noise  component  subspace  with 
projection  operator  UfcU^^. 


Appendix  A:  Householder  Transformation 

Let  z  =  [zi,Z2,...,ZnV  be  a  complex  MN  x  1  vector,  ri  =  z^z,  ei  =  [1,0,  ...,0]^  and  v  = 
z  -p  where  6  =  and  zi  =  Xi  +  jyi-  The  Housholder  transformation  matrix  H  is 

given  by 


H  =  In- 


2vv^ 


(1.126) 
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Then  H-^H  =  In  and  Hz  = 

Proof 

The  first  equality  is  immediate.  Next 


Hz  = 


2vv^z 

z - 77 —  =  z  -  (z  +  T/eie 


2(z  +  Tjeie^^)^z 


■yJi  Y 


—  2  _  (2  +  Tfeie^^) 


(z  +  {z  +  7/eie-?^) 

2(?7^  +  Tyei^ze"-^^) 


rf'  +  r]e~^^ei^z  +  T/e+'^^z^ei  + 


r  ,  (?/  +  ei^ze 


z  —  (z  +  Tyeie-^^) 


■g  r]  +  {xi  cos  0  +  yi  sin  6)  +  j {yi  cos  9  -  Xi  sin  6) 


T]  +  {xi  cos  6  +  yi  sin  6) 


(1.127) 


since  y\  cos  0  —  a;i  sin  0  - „  =  0.  QED 

V^i+2/r  v®i+!/i 

Appendix  B:  Derivation  of  the  pdf  of  r}~^ 

The  pdf  of  77“^  =  z^zi  —  z^Z(Z^Z)~^Z^zi  is  derived  in  this  appendix  by  a  use  of  the  Cochran’s 
theorem  [37]. 

Cochran’s  Theorem:  Assume  that  x  is  a  n  X  1  zero  mean,  white  Gaussian,  random  vector, 
i.e.  1V„(0,I„)  and  C  is  a  symmetric  matrix.  Then 

x^Cx  =  xl 

a 

if  and  only  if 


=  C  and  rank  (C)  =  k. 

Next,  7)'^^  can  be  arranged  to  be  in  the  quadratic  form  x^Cx  as  follows: 

77-^  =  zfzi -zfZ(Z^Z)"^Z^zi 

=  zf  (l- Z(Z^Z)~^Z^)  zi 

=  zf(I-P)zi  ,  (1.128) 

where 

P  =  Z(Z^Z)'’^Z^  (1.129) 

is  evidently  a  projection  matrix  with  rank  equal  to  N-1.  Note  also  that  (I  —  P)  is  also  a  projection 

matrix  of  rank  {K  —  A^+  1).  Since  the  entries  of  Zj  are  i.i.d.  N(0,1)  and  (I  —  P)^  =  (I  —  P),  one 

has  by  Cochran’s  theorem  that 


zf(I-P)zi=  Xk-n+v 


(1.130) 
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Appendix  C:  Derivation  of  the  pdf  of  ( 

The  pdf  of  C  is  derived  in  this  appendix.  One  can  also  find  a  similar  derivation  in  [5].  By  (1.84) 

C=  (l  +  zfZ(Z^Z)"^Z^zi)”^  .  (1.131) 

Next  let 

(5  =  zfZ(Z^Z)"^Z%.  (1.132) 

Since  Z  is  positive  definite,  one  can  find  the  QR  decomposition  of  Z  by  the  Householder  trans¬ 
formation  such  that 


Z  —  Q^Rz 


(1.133) 


where  Qz  is  a  K  X  (N-1)  matrix  with  Q^Qz  —  liV-i  Rz  is  a  (N-1)  x  (N-1)  lower  triangular 
matrix.  The  QR  decomposition  is  performed  by  successively  applying  the  Householder  transfor¬ 
mation  to  each  row  of  the  matrix  Qz-  From  the  top  row  to  the  bottom  row,  one  row  at  a  time, 
one  transforms  the  matrix  Z.  After  the  (N-l)-st  transformation,  one  obtains 


ZQ1Q2 .  •  •  Qn-1  —  Rz  = 


ZA'lle-^^'  0 

*  ||zA'_i||e'^^2  0 


0 

0 


,(1.134) 


* 


|zA'_N+2||e^'^^™^ 


where  Qi  is  the  Householder  transformation  matrix  with  respect  to  the  row  of  length  K-i+1 
and  ||zj||  represents  the  vector  norm  of  length  j. 

By  a  substitution  of  Z  ==  QzRz  (1.132),  5  becomes 

5  =  zfQzRz(RfRz)"'RfQfzi 

=  zfR^^R^^zi  ,  (1.135) 


where  zi  =  Qz^i-  Next,  apply  the  Householder  transform  by  picking  a  unitary  matrix  Q,  such 
that 

Q,zi  =  .  (1.136) 


Then,  one  obtains  the  following  result: 

S  ~  z^^QgR^'^R^’^QgZi 

a 

=  zf  Zie^_i  (r^^R^^)  Zi 

Xn-1 

d  Xk-N+2 


(1.137) 
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A-1  -  A“^B  (dA-^B  +  C-i)  DA-iJ  (A  +  BCD) 

=  I  -  A-'B  (dA-^B  +  C-i)'"  D  -  A-^B  (dA-^B  +  C"')”'  DA-^BCD  +  A-^BCD 
=  I  -  A-^B  (dA-'B  +  C-^CD  +  A-^B  (dA'^B  +  C"!)"'  DA-^BCD  +  A-^BCD 

=  I- A-iB  [(dA-1B  +  C-i)“'c-i+ (dA-1B  +  C-i)"'dA-1B-I  CD 

=  i-a-ib[(da-ib  +  c-i)“'(da-ib  +  c-')-i]cd  =  i  QED 
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Figure  1.10:  Fully  adaptive  STAP. 
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Figure  1.11:  Element-space,  post-doppler  STAR 
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Figure  2  System  Block  Diagram  of  Element  -  Space  Post-  Doppler  STAP  Estimator-  Detector 
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Figure  1.12:  Beam-space,  post-doppler  partial  STAP. 
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Detection  probability  versus  SNR 


Figure  1.13:  Probability  of  detection  of  fully  adaptive  STAP  (M=10,N=10) 


SNR(dB) 


Figure  1.14:  Probability  of  detection  of  fully  adaptive  STAP  (M=5,N=5). 
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Detection  probability  versus  SNR 


Figure  1.15:  Probability  of  detection  of  partially  adaptive  STAP  (M“10,N=10). 


Detection  probability  versus  SNR 


Figure  1.16:  Probability  of  detection  of  partially  adaptive  STAP  (M=5,N=5). 
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Filter  bank  detection  probability  versus  SNR 


Figure  1.17:  Comparison  of  the  probability  of  detection  between  fully  and  partially  adaptive 
STAP  (M=10,N=10) 


Filter  bank  detection  probability  versus  SNR 


Figure  1.18:  Comparison  of  the  probability  of  detection  between  fully  and  partially  adaptive 
STAP  (M=5,N=5) 
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Part  2 


Robust  Detection  and  Estimation 
with  Alpha-Stable  Distributions  for 
STAP  Applications 


Principal  Investigator:  Professor  Chrysostomos  L.  Nikias 

Collaborators:  Russell  Lambert,  Xinyu  (Jack)  Ma,  Dr.  Panagiotis  Tsakalides, 
and  Professor  George  A.  Tsihrintzis 


Spikes  due  to  clutter  sources  such  as  mountains,  forests  or  ocean  waves,  and  glints  due  to  reflec¬ 
tions  from  large  flat  surfaces  such  as  buildings  or  vehicles  are  usually  present  in  radar  returns. 
Their  presence  obscures  the  target  detection  capability  of  a  radar  system  and  degrades  the  sys¬ 
tem’s  performance.  Preprocessing  to  suppress  the  spikes/glints  in  the  radar  returns  and  minimize 
their  effect  on  the  radar  target  detection  performance  can  be  efficiently  done  only  on  the  basis  of 
appropriate  statistical  modeling. 

Very  recently,  a  statistical  model  of  impulsive  interference  has  been  proposed,  which  is  based 
on  the  theory  of  symmetric  alpha-stable  (SaS)  random  processes  [1].  The  model  is  of  a  statistical- 
physical  nature  and  has  been  shown  to  arise  under  very  general  assumptions  and  to  describe  a 
broad  class  of  impulsive  interference.  In  particular,  it  has  been  shown  in  [1]  that  the  first  order 
distribution  of  the  amplitude  of  the  radar  return  follows  a  SaS  law,  while  the  first-order  joint 
distribution  of  the  quadrature  components  of  the  envelope  of  the  radar  return  follows  an  isotropic 
stable  law.  This  model  is  mathematically  more  appealing  than  existing  models  for  impulsive 
interference. 

In  this  part,  we  address  the  solution  of  the  target  detection  and  signal  parameter  estimation 
problems  through  the  use  of  radar  array  sensor  data  retrieved  in  the  presence  of  impulsive  in¬ 
terference  (noise,  clutter,  or  jamming.)  First,  we  describe  new  methods  on  the  modeling  of  the 
amplitude  statistics  of  airborne  radar  clutter  by  means  of  alpha-stable  distributions.  Then,  we 
introduce  target  detection  and  joint  angle  and  Doppler  estimation  techniques  from  radar  measure¬ 
ments  retrieved  in  the  presence  of  impulsive  noise  modeled  as  a  multivariate  sub-Gaussian  ran¬ 
dom  process.  The  results  are  of  great  importance  in  the  study  of  space-time  adaptive  processing 
(STAP)  for  airborne  pulse  Doppler  radar  arrays  operating  in  impulsive  interference  environments. 
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This  part  includes  the  results  in  the  area  of  alpha-stable  modeling  and  is  organized  as  follows: 

1.  Introduction  of  the  statistical  model,  based  on  the  class  of  symmetric  ce-stable  {SaS)  dis¬ 
tributions  [Section  2.1]; 

2.  Characterization  and  development  of  methods  for  parameter  estimation  of  SaS  distri¬ 
butions  by  means  of  order  statistics,  fractional  lower-order  and  negative-order  moments. 
Modeling  of  the  amplitude  statistics  of  radar  clutter  by  means  of  SaS  distributions  and 
estimation  of  the  parameters  of  the  stable  distributions  from  real  clutter  of  the  Mountain 
Top  Database  [Section  2.2]; 

3.  Development  of  an  adaptive  matched-filter  detector  for  the  case  of  alpha-stable,  sub- 
Gaussian  noise.  Comparison  of  the  performance  of  the  proposed  detector  to  the  that  of 
the  optimum  detector  under  completely  known  signal  and  Gaussian  noise  characteristics 

[Section  2.3]; 

4.  Development  of  the  Cauchy  Beamformer  for  joint  spatial-  and  Doppler-frequency  maximum 
likelihood  estimation.  The  Cauchy  Beamformer  is  shown  to  be  very  robust  in  a  wide  range 
of  additive  impulsive  noise  environments.  [Section  2.4.1]; 

5.  Development  of  a  new  joint  spatial-  and  doppler-frequency  high-resolution  estimation  tech¬ 
nique  based  on  the  eigendecomposition  of  the  covariation  matrix  of  the  space-time  radar 
measurements  [Section  2.4.2]. 

2.1  Alpha-Stable  Random  Variables  and  Processes 

Gaussian  distributions  and  processes  have  long  been  accepted  as  useful  tools  for  stochastic  mod¬ 
eling.  In  this  section,  we  introduce  a  statistical  model  based  on  the  class  of  symmetric  a-stable 
(SaS )  distributions  which  is  well-suited  for  describing  signals  that  are  impulsive  in  nature.  A 
review  of  the  state  of  the  art  on  stable  processes  from  a  statistical  point  of  view  is  provided  by 
a  collection  of  papers  edited  by  Cambanis,  Samorodnitsky  and  Taqqu  [2].  Several  statisticians 
including  Cambanis,  Zolotarev,  Weron,  et  al.  have  published  extensively  on  the  theory  and  appli¬ 
cations  of  stable  processes.  They  studied  the  properties  of  stable  processes  [3,  4,  5],  their  spectral 
representation  [6,  7],  as  well  as  prediction  and  linear  filtering  problems  [8,  9].  Textbooks  in  the 
area  were  written  by  Samorodnitsky  and  Taqqu  [10]  and  by  Janicki  and  Weron  [11]-  An  extensive 
review  of  stable  processes  from  a  signal  processing  point  of  view  can  be  found  in  a  tutorial  paper 
by  Shao  and  Nikias  [12]  as  well  as  in  a  monogram  written  by  the  same  authors  [1]. 

2.1.1  The  Class  of  Real  SaS  Distributions 

The  symmetric  cr-stable  [SaS)  distribution  is  best  defined  by  its  characteristic  function 

=  exp(y(5c<;  -  7la;|“),  (2.1) 

where  a  is  the  characteristic  exponent  restricted  to  the  values  0  <  o  <  2,  5  (-oo  <  S  <  oo)  is 
the  location  parameter^  and  7  (7  >  0)  is  the  dispersion  of  the  distribution.  For  values  of  a  in  the 
interval  (1, 2],  the  location  parameter  6  corresponds  to  the  mean  of  the  SaS  distribution,  while  for 
0  <  O'  <  1,  <5  corresponds  to  its  median.  The  dispersion  parameter  7  determines  the  spread  of  the 
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distribution  around  its  location  parameter  similar  to  the  variance  of  the  Gaussian  distribution. 
The  characteristic  exponent  a  is  the  most  important  parameter  of  the  SaS  distribution  and  it 
determines  the  shape  of  the  distribution. 

A  stable  distribution  is  called  standard  if  ^  =  0  and  7  =r  1.  Clearly,  if  a  random  variable  X 
is  stable  with  parameters  a,  7,  5,  then  {X  -  is  standard  with  characteristic  exponent  a. 

The  standard  SaS  density  functions  for  a  few  values  of  the  characteristic  exponent  a  are  shown 
in  Figure  2.1. 

By  letting  a  take  the  values  1  and  2,  we  get  two  important  special  cases  of  SaS  distributions, 
namely,  the  Cauchy  {a  =  1),  and  the  Gaussian  (a  =  2): 

Cauchy 

-^2+(^_5)2  (2.2) 

Gaussian  ^ 

•^2  (7.  ^  exp  [-  (2.3) 

Unfortunately,  no  closed  form  expressions  exist  for  general  SaS  distributions  other  than  the 
Cauchy  and  the  Gaussian.  However,  power  series  expansions  can  be  derived  for  fa{j^S;x)  [12]. 
Although  the  SaS  density  behaves  approximately  like  a  Gaussian  density  near  the  origin,  its 
tails  decay  at  a  lower  rate  than  the  Gaussian  density  tails.  While  the  Gaussian  density  has 
exponential  tails,  the  stable  densities  have  algebraic  tails  (cf.  Figure  2.2).  The  smaller  the 
characteristic  exponent  a  is^  the  heavier  the  tails  of  the  SaS  density.  This  implies  that  random 
variables  following  SaS  distributions  with  small  characteristic  exponents  are  highly  impulsive.  It 
is  this  heavy-tail  characteristic  that  makes  the  SaS  densities  appropriate  for  modeling  signals 
and  noise  or  interference  which  are  impulsive  in  nature. 

SaS  densities  obey  two  important  properties  which  further  justify  their  role  in  data  modeling: 

•  The  stability  property^  which  states  that  the  random  variables  Xi, . . . ,  Xn  are  independent 

and  symmetrically  stable  with  the  same  characteristic  exponent  a  if  and  only  if  for  any 
constants  ai, . . .,  a^^  the  linear  combination  is  also  SaS] 

•  The  generalized  central  limit  theorem,,  which  states  that  the  family  of  stable  distributions 
contains  all  limiting  distributions  of  sums  of  i.i.d.  random  variables. 

An  important  difference  between  the  Gaussian  and  the  other  distributions  of  the  5Qf5  family 
is  that  only  moments  of  order  less  than  a  exist  for  the  non-Gaussian  SaS  family  members.  The 
fractional  lower  order  moments  (FLOM’s)  of  a  SaS  random  variable  with  zero  location  parameter 
and  dispersion  7  are  given  by: 

E\X\P  =  C{p,  a)  A  for  0  <  p  <  a  (2.4) 


where 

C(p,  a) 

and  r(-)  is  the  Gamma  function. 


2P+ir(^)r(-g) 

“v^r(-|) 


(2.5) 
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2.1.2  Symmetric  Sub-Gaussian  Alpha-Stable  Processes 

A  collection  of  random  variables  {X{t),  t  €  T}  where  T  is  an  arbitrary  index  set,  is  said  to 
be  a  SaS  stochastic  process  if  for  all  combinations  of  distinct  indices  G  T,  the  ran¬ 
dom  variables  X{ti), . .  are  jointly  SaS  with  the  same  characteristic  exponent  a.  The 

stochastic  process  {X{t),  t  G  T}  is  stationary  if  the  random  vectors  (A(ii), . .  .,X(t„))  and 
(A(ti  +  s), . . . ,  A {tn  +  s))  are  identically  distributed  for  each  choice  of  s,  ti, . . . ,  G  T.  The  fam¬ 
ily  of  stable  processes  has  many  members  with  mutually  exclusive  properties.  In  the  following, 
we  present  the  important  class  of  sub-Gaussian  stable  processes  which  we  will  use  extensively. 


Sub-Gaussian  Processes 


A  stable  process  {A(t),  t  £  T}  is  said  to  be  an  a-sub-Gaussian  process  (q!-SG(£))  if  for  distinct 
indices  ti, . . . ,  G  T,  the  random  vector  A  =  [A (ti), . . . ,  A {tn)Y  has  characteristic  function  of 
the  general  form 

=  exp[-i(w^^)"/^],  (2.6) 

where  £  is  a  positive-definite  matrix,  called  the  underlying  matrix  of  the  process,  w  =  [wi , . . . , 
and  oTakes  values  in  (1,2].  When  a  =  2,  A  is  a  Gaussian  vector  with  zero  mean  and  covari¬ 
ance  matrix  R.  Unfortunately,  closed-form  expressions  for  the  joint  pdf  of  sub-Gaussian  random 
vectors  are  known  only  for  the  Gaussian  (a  —  2)  and  Cauchy(Q;  =  1)  cases: 


fG{X) 


fc{X) 


1 


exp{-X^R-^X) 


[1  + A^E"^A](^+i)/2 


(Gaussian) 


(Cauchy) , 


(2.7) 

(2.8) 


where  L  is  the  length  of  the  random  vector,  ||^||  is  the  determinant  of  and  c  =  i2^  {^^)  ■ 

Sub-Gaussian  processes  share  many  common  features  with  Gaussian  processes.  In  fact, 
sub-Gaussian  processes  are  variance  mixtures  of  Gaussian  processes  [13].  Specifically,  any  sub- 
Gaussian  random  vector  can  be  expressed  in  the  form 


A  = 


(2.9) 


where  -u;  is  a  positive  |-stable  random  variable  [14]  and  G  is  a  Gaussian  random  vector  of  mean 
zero  and  covariance  matrix  R.  An  important  distinction  between  Gaussian  and  sub-Gaussian 
processes  is  that,  while  linear  spaces  of  Gaussian  random  variables  may  contain  non-degenerate 
independent  elements,  sub-Gaussian  random  variables  cannot  be  independent  [15]. 

Sub-Gaussian  SoS  processes  combine  the  capability  to  model  statistical  dependence  with  the 
capability  to  model  the  presence  of  outliers  in  observed  time  series  of  various  degrees  of  severity. 
The  example  in  Figure  2.3  is  indicative  of  the  concept.  Consider  a  sub-Gaussian  vector  of  length 
L  =  100  and  diagonal  underlying  covariance  matrix  R  =  diag  {1,1,...,!}.  Typical  realizations 
of  the  vector  are  shown  for  characteristic  exponents  a  —  2  and  a  =  1.5.  Clearly,  it  is  difficult 
to  distinguish  one  vector  from  the  other  visually.  However,  if  we  look  over  1000  independent 
realizations  of  the  first  component  of  the  vector,  a  clear  difference  is  observed. 

In  modeling  the  signals  and/or  noise  for  the  parameter  estimation  problem,  we  need  a  complex 
model  for  the  noise  samples.  We  also  need  to  define  quantities  which  describe  correlations  between 
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random  variables.  In  the  following  section,  we  present  the  family  of  isotropic  complex  SaS 
distributions  and  describe  their  correlation  properties  by  means  of  the  covariation  quantity. 


2,1.3  Complex  SaS  Random  Variables  and  Covariations 

A  complex  random  variable  (r.v.)  X  =  Xi  +  JX2  is  symmetric  a-stable  (SaS)  if  Xi  and  X2  are 
jointly  SaS  and  its  characteristic  function  is  written  as 


(p{oj) 


=  Z?exp[j9?(a;X*)]  =  E  exY>[j{oJiXi  +0J2X2)] 
=  exp  [-  /  \uJiXi+U2X2\°‘dTxi,X2{^i^^2)  , 


(2.10) 


where  u>  =  ui  +JUJ21  3^[-]  is  the  real  part  operator,  and  Vxi,X2  is  a  symmetric  measure  on  the  unit 
sphere  S2,  called  the  spectral  measure  of  the  random  variable  X.  A  complex  random  variable 
X  =  Xi  +  JX2  is  isotropic  if  and  only  if  (Xi,  ^2)  has  a  uniform  spectral  measure.  In  this  case, 
the  characteristic  function  of  X  can  be  written  as 


(p{uj)  =  £^exp(y3?[wX*])  =  exp(-7|a)|"),  (2.11) 

where  7  (7  >  0)  is  the  dispersion  of  the  distribution. 

In  the  theory  of  second-order  processes,  the  concept  of  covariance  plays  an  important  role  in 
problems  of  linear  prediction,  filtering  and  smoothing.  Since  SaS  processes  do  not  possess  finite 
pth  order  moments  for  p  >  cv,  covariances  do  not  exist  on  the  space  of  SaS  random  variables. 
Instead,  a  quantity  called  covariation  plays  an  analogous  role  for  statistical  signal  processing 
problems  involving  SaS  processes  to  the  role  played  by  covariance  in  the  case  of  second-order 
processes. 

Several  complex  r.v.’s  are  jointly  SaS  if  their  real  and  imaginary  parts  are  jointly  SaS,  When 
X  =  Xi+  JX2  and  Y  =  JY2  are  jointly  SaS  with  1  <  a  <  2,  the  covariation  of  X  and  Y  is 
defined  by 

[X,Y]a=  f  {xi  +  jX2){yi+  jy2)"^°‘~^^dTxuX2,YuY2i^i^^2,yi,y2),  (2.12) 

where  we  use  throughout  the  convention 

y</3>  ^  |y|/5-iy*.  (2.13) 


It  can  be  shown  that  for  every  1  <  p  <  a,  the  covariation  can  be  expressed  as  a  function  of 
moments  [8] 

p  YY<p-l> 

[X.y]a=  TT.  (2.14) 


E|y|i 


where  yy  is  the  dispersion  of  the  r.v.  Y  given  by 


E\Y\- 

*'  C(p,c,) 


for  0  <  p  <  O', 


(2.15) 


with 


C{p,  a) 


2P+ir(H^)r(-^) 

«r(i)r(-|) 


(2.16) 
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Obviously,  from  (2.14)  it  holds  that 

Also,  the  covariation  coefficient  of  X  and  Y  is  defined  by 

.  IX,Y]„ 

^X,Y  = 


and  by  using  (2.14),  it  can  be  expressed  as 

EXY<P-^> 


^X,Y  = 


E\Y\P 


for  1  <  p  <  O'. 


(2.17) 


(2.18) 


(2.19) 


The  covariation  of  complex  jointly  SaS  r.v.’s  is  not  generally  symmetric  and  has  the  following 
properties  [3]: 

PI  If  Xi,  X2  and  Y  are  jointly  SaS,  then 

[aA'i  +  bX2,  F]a  =  a[Xi,Y]c  +  b[X2,  Tja  (2-20) 

for  any  complex  constants  a  and  b. 

P2  If  Yi  and  Y2  are  independent  and  X\,  X2  and  Y  are  jointly  SaS,  then 

[aXi,bYi  +  cYaja  =  ab<°-^>[XuYi]a  +  ac<“-i>  [Xi,  Yaj^  (2.21) 


for  any  complex  constants  a,  b  and  c. 

P3  If  X  and  Y  are  independent  SaS,  then  [X,Y]a  =  0 

P4  Let  {Uf,  i  =  l,...,n}  be  independent  complex  SaS  r.v.’s  with  dispersions  7^.  For  any 
complex  numbers  {a,-,  i=  1,. .  .,n}  form 

X  =  flil/i  +  ■  ■  •  +  ttnUn, 

Y  =  biUi  +  ---  +  bnUn. 

Then 

[X,X]^  =  7i|air  +  ---  +  7n|«nr, 

[Y,Y]«  =  7i|6il“  +  ---  +  7n|M“- 

[X,  Y]„  =  jiaib<^-^>  +  •  •  •  +  7„a„6<“-i>.  (2.22) 

2.1.4  Generation  of  Complex  Isotropic  SaS  Random  Variables 

The  generation  of  complex  isotropic  SaS  deviates  of  characteristic  exponent  a  is  based  on  the 
following  proposition  found  in  [10]: 

Proposition  2.1  A  complex  SaS  (a  <  2)  random  variable  X  =  Xi  JX2  is  isotropic  if  and 
only  if  there  exist  two  i.i.d.  zero-mean  Gaussian  variables  Gi  and  G2  o,P^d  a  real  stable  random 
variable  A  of  characteristic  exponent  a/2,  dispersion  cos^(7ra/4)  and  skewness  /?  =  1  (we  write 

A  ~  5a/2(cos^(7rQ;/4),  l)j,  independent  of  {Gi,G2)  such  that  {Xi,X2)  =  (A^/^Gi,  A^/^G2). 
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We  say  that  the  vector  (Xj,  X2)  is  sub-Gaussian  with  underlying  vector  (C?i,  G2).  It  can  be  shown 
that  the  real  and  imaginary  parts  of  X  are  always  dependent,  unless  Gi  and  G2  are  degenerate. 
Hence,  every  complex  isotropic  SaS  random  variable  with  a  <.  2  can  be  expressed  as 

X  =  A^^^{Gi  +  JG2),  (2.23) 


and  its  generation  involves  the  generation  of  a  real,  totally  skewed  stable  random  variable.  The 
problem  of  generating  a  real  stable  deviate  is  studied  in  [16]  and  [11].  Here,  we  present  the  result 
for  easy  reference. 

To  generate  a  real  standard  stable  random  variable  A  ~  5a(l,/3)  of  characteristic  exponent 
O',  skewness  /?  and  unit  dispersion  7=1,  the  following  representations  can  be  deduced: 


^  sma{U-Uo)  fcos{U-a{U-Uo)) 

S(a:,/3, 1)  = 


for  a  ^  1, 


(2.24) 


and 


5(1,/3,1)  =  ^ 


+  /3f/)  tanC7-/31n( 


fWcosU 


(2.25) 


where  W  is  standard  exponential  with  Pr{W  >  w}  =  e““',  tu  >  0,  and  U  is  uniform  on  (-|,  f). 
Also,  Da, 13  =  [cos(arctan(/3tan(7ra/2)))]-^/°',  and  Uq  =  -^I3[k{a)/a]  with  A:(q;)  =  1  -  |1  -  a;|. 
Then,  a  stable  variate,  Ai,  of  dispersion  7  can  be  obtained  from  A  by  Ai  =  7^/“A. 


2.2  Parameter  Estimation  with  Fractional  Lower-Order  Mo¬ 
ments 

The  moment  theory  for  SaS  processes  is  based  on  fractional  lower-order  statistics.  The  fractional 
lower-order  moment  of  a  SaS  random  variable  X  satisfies: 

E(|A|P)  =  Ci(p,  a)7»’/“,  for  -  1  <  p  <  a,  (2.26) 

where  C'i(p,  a)  =  characteristic  exponent  (0  <  a  <  2),  7  is  the  dispersion 

and  r(-)  is  the  Gamma  function.  When  X  is  an  n  dimensional  spherically  symmetric  SaS  random 
variable,  E(|X|p)  is  given  in  [5]: 

E(|X|P)  =  -n<p<a.  (2.27) 

Specifically,  when  X  is  an  isotropic  bivariate  SaS  random  variable,  we  have: 

E(|XP)  =  C2ip,  a)7P/“,  for  -  2  <  p  <  a,  (2.28) 

where  C2(p,  a)  =  2^^^^  The  simplest  yet  most  important  class  of  complex  SaS  random 

variables  is  the  class  of  isotropic  bivariate  SaS. 

Notice  that  the  value  of  p  (the  order  of  moment)  can  be  both  negative  and  positive.  Using 
this  property,  we  can  estimate  the  parameters  (a,  7)  of  SaS  processes  [17]. 
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•  Sine  Function  Estimator: 

Letting  0  <  p  <  min  (a,  1),  such  that  both  positive-  and  negative-order  moments  are  finite, 
then: 

E(|A-|>’)E(|X|-!')  =  (2.29) 

a  sm[pK/0!) 

i.e.,  a  can  be  found  by  solving  the  following  sine  function  equation: 


sinc( — ) 
a 


sin(^)  _  2tan(p7r/2) 


0  <  p  <  min(a,  1). 


(2.30) 


The  above  equation  does  not  involve  7.  Once  a  is  estimated,  7  can  obtained  from  (2.26): 


^  \Ci(p,a)J 


(2.31) 


•  log  |SaS|  Estimator: 

By  rewriting  EdATl^)  as  E(eP'°sl'^l)  and  by  defining  the  new  random  variable  Y  =  log  \X\* 
we  have  that: 

E(|X|P)  =  E(eP‘°sl^l)  =  E(eP^),  -1  <  p  <  a,  (2.32) 

where  E(eP^)  is  the  moment-generating  function  of  Y.  By  expanding  E(eP^)  into  power 
series: 

CO  k 

E(eP^)  =  £E(y'')^,  (2.33) 

k=0 

and  by  using  with  (2.26),  we  can  see  that  the  moments  of  Y  of  any  order  must  be  finite 
and  they  satisfy: 

E(y '=)  =  ^  (Ci  (p,  a)7"/“)  |p=o.  (2.34) 

Simplifying  the  above  equation,  we  have  [18]: 

E(y)  =  Ce(^-l)  +  ilog7,  (2.35) 


where  Cg  =  0.57721566  •  ■  •  is  the  Euler  constant,  a  is  the  characteristic  exponent,  7  is  the 
dispersion,  and 


Var(y)  =  E{(y  -  E{y})2}  =  f  ^  +  1 


E{(y-E{y})"}  =  2C(3)  -3-1 


q;'= 


(2.36) 

(2.37) 


where  C(‘)  is  the  Riemann  Zeta  function,  and  C(3)  is  a  constant:  (:’(3)  =  1.2020569  ••  •. 
The  higher-order  moments  of  Y  always  exist  and  starting  from  the  second-order  moment, 
the  equations  only  involve  o;.  This  property  provides  a  simple  algorithm  to  estimate  the 
parameters  (o;,  7)  of  a  SaS  random  process.  Since  we  can  estimate  the  mean  and  variance 
of  y  by: 

EfciX'  ,2  _E2,i(U-JT! 


y  = 


N 


ay  = 


N -1 


(2.38) 


‘Note  that  log  |X|  is  bounded  because  the  p.d.f.  of  X  f{x)  is  bounded  at  2;  =  0. 
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Table  2.1:  Performance  comparison  of  the  log  |Sq;S|  Estimator  versus  the  Sine  Function  Estimator. 
The  sample  size  is  5000,  the  true  values  are  a  =  1.5  and  7=1. 


Estimation  Method 

a 

7 

log]  SaS  1  Estimator 

1.4969 

(0.0522) 

0.9989 

(0.0385) 

Sine  Function  Estimator 

1.5027 

(0.0536) 

1.0023 

(0.0423) 

where  N  is  the  number  of  i.i.d.  samples  of  {i;},  by  solving  (2.36)  we  can  obtain  an 
estimate  of  a  and  substitute  into  (2.35)  for  an  estimate  of  7.  The  log  |Sa:S|  process  was 
first  introduced  by  Zolotarev  [5]. 

Table  2.1  shows  the  computer  simulation  results  for  the  above  two  estimators.  As  we  can  see,  both 
estimators  have  mean  value  close  to  the  true  one  and  low  variances  (numbers  in  parentheses.) 

2.2.1  Mountaintop  Data  Analysis:  Clutter  Modeling  Using  SaS  Theory 

The  data  files  used  in  our  experiments  are  from  the  Mountaintop  Data  Package  tape.  The  clutter 
data  set  contains  measured  monostatic  clutter  using  the  Radar  Surveillance  Technology  Exper¬ 
imental  Radar  (RSTER).  The  data  consist  of  four  files:  cm435al.bfr,  cm435bl.bfr,  cm435cl.bfr, 
and  ncal435a.bfr.  These  data  files  were  collected  on  August  27,  1993  as  part  of  the  Clutter  Map 
Multi-Frequency  Experiment.  The  data  are  unequalized  I/Q  data  and  the  waveform  employed 
was  a  5  fiSec  pulsed  CW  signal  at  435  MHz.  file  cm435al  covers  from  3.5  to  43.5  nmi  in  range. 
File  cm435bl  covers  from  40  to  80  nmi  in  range.  File  cm435cl  covers  from  75  to  115  nmi  in  range. 
Each  range  has  a  sample  size  of  495.  All  of  these  files  have  enough  CPIs  to  cover  360  degrees  in 
azimuth.  The  file  ncal435a  contains  the  system  noise.  In  our  comprehensive  study,  we  conduct 
our  experiments  in  24  different  azimuth  angles  with  15  degrees  interval.  At  each  azimuth  angle 
of  interest,  we  concatenate  the  data  from  the  three  ranges  to  form  a  single  long  stream  (1485 
samples).  Experiments  with  the  actually  raw  data  are  performed  with  the  I/Q  clutter  data  on 
Sensor  #  1  (There  are  14  sensors  in  the  linear  array).  For  the  SaS  modeling,  the  parameters 
(characteristic  exponent  a  and  dispersion  7)  are  first  estimated  using  fractional  lower  order  mo¬ 
ments,  then  estimated  by  amplitude  probability  density  (APD)  curve  fitting  using  exhaustive 
search.  Noting  that  these  parameters  should  remain  the  same  for  both  the  I-  and  Q-components 
of  the  clutter  data,  in  the  analysis  of  APD  we  focus  on  the  I-component  for  simplicity.  In  the 
Gaussian  model,  the  variance  is  estimated  by  taking  standard  deviation  of  the  data.  The  samples 
are  assumed  to  be  i.i.d.  and  for  convenience,the  data  have  been  scaled  by  10®.  In  Figures  2.4  to 
2.17,  we  show  the  I/Q  data  time  series,  estimated  parameters,  and  APD  comparison  along  several 
azimuth  angle  values  at  15,  45,  60,  105,  120,  240,  and  315  degrees.  The  experiments  demonstrate 
that  the  SaS  distribution  is  superior  to  the  traditional  Gaussian  distribution  for  modeling  the 
actual  radar  clutter  data. 
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2.3  Signal  Detection  in  Sub-Gaussian  Impulsive  Interference 

In  this  section,  we  address  the  problem  of  coherent  detection  of  a  signal  embedded  in  heavy¬ 
tailed  noise  modeled  as  a  sub-Gaussian,  alpha-stable  process.  We  assume  that  the  signal  is  a 
complex- valued  vector  of  length  L,  known  only  within  a  multiplicative  constant.  The  dependence 
structure  of  the  noise,  i.e.,  the  underlying  matrix  of  the  sub-Gaussian  process,  is  not  known.  The 
intent  is  to  implement  a  generalized  likelihood  ratio  detector  which  employs  robust  estimates 
of  the  unknown  noise  underlying  matrix  and  the  unknown  signal  strength.  The  performance 
of  the  proposed  adaptive  detector  is  compared  to  that  of  an  adaptive  matched  filter  that  uses 
Gaussian  estimates  of  the  noise  underlying  matrix  and  the  signal  strength  and  is  found  to  be 
clearly  superior.  The  proposed  new  algorithms  are  theoretically  analyzed  and  illustrated  in  a 
Monte-Carlo  simulation. 

2.3.1  Estimation  of  the  Underlying  Matrix  of  a  Sub-Gaussian  Vector 

Before  we  attack  the  signal  detection  problem  in  sub-Gaussian  interference,  we  present  some 
necessary  theory  in  the  estimation  of  the  underlying  matrix  of  a  sub-Gaussian  vector.  The 
following  proposition  found  in  [10,  pp.  89]  expresses  the  underlying  matrix  of  a  sub-Gaussian 
vector  in  terms  of  its  covariation  matrix  and  can,  therefore,  be  used  to  obtain  high  quality 
estimates  of  the  underlying  matrix  of  the  vector  from  independent  observations. 

Proposition  2.2  Let  2L=  V2, . . . ,  be  a  sub-Gaussian  random  vector  with  underlying 

covariance  function  R  (cf.  (2.6)).  Then,  its  covariation  matrix  Q_  will  consist  of  the  elements 

Cij  =  [X„  X,-],  -  (2.39) 

The  usefulness  of  the  proposition  lies  in  finding  consistent  estimators  of  the  underlying  co- 
variance  matrix  of  a  sub-Gaussian  vector  from  independent  realizations  X^,  . . .  ,X^  of  the 

vector.  The  elements  Cij  of  the  covariation  matrix  Q_  can  be  estimated  by  combining  moment 
approximations  of  the  expressions  (2.14),  (2.15),  and  (2.16)  as  follows 

Cij  =  C{p,  a)[]-  (2.40) 

kz=l  k^l 

where  any  p  <  f  will  result  in  a  consistent  estimate"^  [19],  as  shown  in  the  following  proposition. 

Proposition  2.3  The  estimator  Cij  of  the  covariation  matrix  elements  given  in  (2.40),  where 
p  <  a/2,  is  consistent  and  asymptotically  normal  with  mean  Cij  and  covariance  S{{Cij  — 
Cij){Cim  —  Cim)*}  CIS  in  Appendix  A. 

Proof  See  Appendix  A.  | 

Equation  (2.39)  can  now  be  used  to  compute  an  estimate  of  the  underlying  matrix  R  from 
the  estimate  of  the  covariation  matrix  Q_. 

^We  have  empirically  found  that  a  good  choice  is  p  =  j. 
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Proposition  2.4  Let  Cij  be  the  estimator  of  the  covariation  matrix  elements  given  in  (2.40), 
where  p  <  a/2.  The  estimates 


4-  =  2f4-/47" 


(2.41) 

(2.42) 


of  the  elements  of  the  underlying  covariance  matrix  are  consistent  and  asymptotically  normal 
with  means  Rjj  and  Rij,  respectively,  and  variances  as  in  Appendix  B. 


Proof  See  Appendix  B.  | 

The  procedure  is  illustrated  with  the  following  simulation  study:  Consider  a  sub-Gaussian 
random  vector  of  length  T  =  64  and  underlying  matrix  R  =  diag  {1, 1, . . .,  1}.  We  assume  that 
K  =  500  independent  realizations  of  the  vector  are  available  and  plot  the  32"^^  row  of  the  mean 
over  100  Monte-Carlo  simulations  of  the  following  two  estimates: 


1  ^  T 

k  =  —  V  x^x’^  (2.43) 

k=\ 

R  =  as  obtained  from  covariation  matrix  estimates  (2.42).  (2.44) 

We  examined  the  cases  of  o;  =  2  and  a  =  1.5.  Clearly,  the  Gaussian  estimate  fails  when  a  =  2, 
while  the  covariation-based  estimate  maintains  high  performance  in  both  the  cases  of  a  =  2  and 
a  =  1.5. 

Next,  we  consider  the  estimation  of  the  amplitude  of  a  signal  of  known  shape  embedded 
in  sub-Gaussian  noise  from  a  number  of  independent  observations.  The  following  Proposition 
outlines  the  procedure  and  states  its  performance. 


Proposition  2.5  Consider  the  collection  of  K  vectors  =  As  4-  ,  k  =  1,2,...,  A,  where 

^  —  1.  Form  the  least-squares  estimates  Ak  =  s'^X^  =  s^As  4-  s'^N^  =  A  +  ^ ,  k  = 


s  s 


1,2,  ...,A'.  Define  A  =  sm  {Ai,  A2, . . . ,  Ak},  where  indicates  the  sample  median 

of  its  arguments.  The  estimate  A  is  consistent  and  asymptotically  normal  with  mean  equal  to 


the  true  signal  amplitude  A  and  variance  where  7  =  2  — 

2~i  (s^ Rs)2  . 


Proof  The  random  variables  Ak,  k  =  1,  2, . . .,  A,  will  be  independent,  each  of  pdf  f{x),  which 
can  be  computed  as  follows.  We  begin  with 

Ak  =  A-^  s^K’'  =  A  + 


where  G^,  k  =  1,2,  ...,A,  are  independent  Gaussian  random  vectors,  each  of  mean  zero  and 
covariance  matrix  R.  Therefore,  s^G^,  k  =  1,2,...,K,  are  independent  Gaussian  random  vari¬ 
ables  of  mean  zero  and  variance  s^Rs,  which  implies  that  s^Lk,  k  =  1,2,..., A',  are  indepen¬ 
dent  sub-Gaussian  random  variables  of  length  L  =  I  and  dispersion  7  =  2~'^{^ R^^ .  Thus, 

f{x)  =  fail,  A;  x). 
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From  [20,  p.  369],  it  follows  that  the  sample  median  of  Ak,  A:  =  1, 2, . . . ,  A',  is  asymptotically 
(for  K  oo)  normal  with  mean  equal  to  the  true  median  (A)  and  variance  'k[2fa{-y,A;A)^‘^- 
fa{l,S\x)  =  (x  -  and  /a(l,0;0)  =  [21],  Combining  the  last  two 

relations,  we  get 

1  r  1  12  ^  1  .  2 

/f4/,(7,A;A)J  KhT{l/ay 

as  the  asymptotic  variance  of  the  estimator  A.  | 


2.3.2  Data-Adaptive  Algorithms  for  Coherent  Signal  Detection 

We  consider  the  hypothesis  testing  problem 

Ho  :  A^  =  iV^ 

k  —  1,  2, . . . ,  A 

Hi  :  X  =  S  +  Ny 

where  all  the  vectors  have  dimension  (length)  L  and  A;  =  1,  2,...,K  indexes  independent,  iden¬ 
tically  distributed  realizations. 

We  make  the  following  assumptions: 

1.  The  noise  vectors  N_^  have  a  sub-Gaussian  distribution,  i.e., 

=  wlay 

where  Wk  is  a  positive  (a/2)-stable  random  variable  of  unit  dispersion,  G  is  a  Gaussian 
random  vector  of  covariance  matrix  and  w  and  G  are  independent. 

2.  The  signal  vector  S  =  As  consists  of  a  known  shape  s  (for  which  s^s  =  1)  and  an  unknown 
amplitude  A, 

The  proposed  test  statistic  is  a  generalized  likelihood  ratio  test  that  makes  use  of  the  multi¬ 
dimensional  Cauchy  pdf  defined  in  (2.8): 


tc  =}  log - ^ - Tzi - 

1  +  {X-As)Tr  (X-As) 


(2.45) 


For  the  estimates  R  and  A,  we  choose  the  estimates  proposed  in  (2.42)  and  Proposition  2.5, 
respectively.  Assuming  Gaussian  noise  of  unknown  covariance  matrix  R  and  unknown  signal 
amplitude,  the  data- adaptive  detector  attains  the  form  of  an  adaptive  matched-filter,  i.e.,  it 
computes  the  test  statistic 

to  =  (2/ A)  ^{As)^k~^X-  lllVi  (2-46) 

A:=l 


where  A  =  (1/A')  Ylk=\  s^r-G  '  ^  ""  ^  SfLi  {2L  -  ^s)  {2L  As)'^ . 
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2.3.3  Performance  Analysis  and  Computer  Illustration 
A.  Theoretical  Performance  Analysis 

The  theoretical  performance  of  the  described  detection  algorithm  is  analyzed  by  means  of  the 
following  propositions. 


Proposition  2.6  Assume  that  the  signal  amplitude  A  and  the  noise  underlying  matrix  R  were 
known,  while  the  test  statistic 


1  + 


EMj- 


k=i 


+  {X-As)T^-\X-As) 


(2.47) 


were  used  for  detection  purposes.  The  asymptotic  (for  K  ^  oo)  receiver  operating  characteristic 
would  then  he  _ 

Pf  =  -  erfc[4^ - j= - ],  (2.48) 

2  \/2< 

where  is  the  probability  of  detection  induced  by  the  test  statistic  t^,  when  operating  at  a 

probability  of  false  alarm  Pf^.  In  (2-48),  we  have  made  use  of  the  notation  mHo  = 
mHi  =  €{t^\Hi}  <  oo,  ajj^  =  var  {t^|i7o}  <  oo,  =  var  <  o°- 


Proof  From  the  tail  behavior  of  the  sub-Gaussian  distributions,  we  can  show  that  the  above- 
defined  are  finite.  Therefore,  as  K  oo,  the  test  statistic  t^  approaches 

a  Gaussian  random  variable  under  either  hypothesis.  From  this  observation,  the  asymptotic 
performance  follows.  | 

Proposition  2.7  Let  t^  be  as  in  Proposition  2.6,  (cf.  (2.47)).  The  asymptotic  (as  K  oo) 
performance  of  the  test  statistic  tc  in  (2.45)  is  the  same  as  the  asymptotic  performance  of  the 
test  statistic  t^,  that  requires  knowledge  of  the  signal  amplitude  and  the  noise  underlying  matrix. 

Proof  From  Propositions  2.4  and  2.5,  R  and  A  are  consistent  estimators  ol R  and  A,  respectively, 
as  A  oo.  Slutsky’s  theorem  [22,  p.  461]  guarantees  that  the  asymptotic  distribution  of  tc 
will  be  the  same  as  the  asymptotic  distribution  of  t^  and  subsequently  tc  will  result  in  the  same 
asymptotic  receiver  operating  characteristic  as  in  Proposition  2.5.  | 


B.  Computer  Illustration  of  the  Detector  Performance 

The  small  sample  performance  of  both  the  Gaussian  and  the  proposed  Cauchy  detectors  can  be 
accurately  assessed  only  via  Monte-Carlo  simulation.  To  this  end,  we  chose  an  observation  vector 
of  length  L  =  8  and  K  =  10  independent  copies  of  it,  while  for  the  signal  we  chose  a  shape  of 
a  square  pulse  of  height  1/v^  and  an  amplitude  of  A  =  1.  The  sub-Gaussian  interference  was 
assumed  to  be  of  characteristic  exponent  a  =  2, 1.75, 1.5, 1.25, 1,  and  0.75,  and  underlying  matrix 
R  =  diag  {1,1,...,!}.  The  performance  of  the  Gaussian  and  the  Cauchy  detectors  was  assessed 
via  10,000  Monte-Carlo  runs. 

In  Figure  2.19,  we  compare  the  performance  of  the  Gaussian  and  the  Cauchy  detectors  for 
different  values  of  the  characteristic  exponent  a.  We  see  that,  for  a  =  2,  the  Gaussian  detector, 
as  expected,  outperforms  the  Cauchy  detector;  however,  for  all  other  values  of  a,  the  Cauchy 
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detector  maintains  a  high  performance  level,  while  the  performance  of  the  Gaussian  detector 
deteriorates  down  to  unacceptably  low  levels.  In  Figure  2.20,  we  show  the  performance  of  the 
Gaussian  and  the  Cauchy  detectors  for  different  values  of  the  characteristic  exponent  a. 


2.3.4  The  Special  Case  of  a  Single  Observation 


Consider  the  data  model 


X  -  A5  +  iV, 


(2.49) 


where  A  is  an  unknown  signal  amplitude,  5  is  a  known  signal  shape,  and  TV  is  a  sub-Gaussian 
random  noise  vector.  Let  us  for  the  moment  assume  that  the  underlying  matrix  of  the  distribution 
is  known.  Several  immediate  observations  become  apparent:  The  maximum  likelihood  estimate 
Ami  satisfies 

Ami=  argmax„/(X  -  as),  (2.50) 


where  /(•)  is  the  known  sub-Gaussian  distribution.  We  can  make  use  of  the  representation  (2.9) 
and  proceed  to  write 

roo 

Ami=  argmax^  /  fiiu)f2i2L-as\u)du,  (2.51) 

Jo 

where  /i(-)  is  the  distribution  of  the  square  root  of  a  positive,  —stable  random  variable  and 
/2(X  -  as|a)  =  /g(~~°~)  with  /g(-)  the  multivariate  Gaussian  distribution  of  covariance  matrix 
R.  Clearly: 

X  —  as 

Ami  =  argmax^/G(= - =),  (2.52) 


i.e. 


s^R-^X 

s^R~^s 


(2.53) 


We  have,  in  other  words,  shown  that  the  maximum  likelihood  estimate  Ami  is  the  least-squares 
estimate,  as  in  the  Gaussian  case;  however,  in  the  sub-Gaussian  case,  the  estimate  is  not  efficient 
and,  in  fact,  contains  infinite  error  variance.  As  a  consequence,  the  estimate  Ami  is  not  consistent 
and  the  performance  of  the  test  statistic  tc  in  which  this  estimate  is  used  will  not  approximate 
the  performance  of  the  test  statistic  t^  in  which  exact  knowledge  of  A  (and  ^)  is  assumed. 


2.3.5  Summary  and  Conclusions 

In  this  section,  we  addressed  the  problem  of  detection  of  a  signal,  known  within  a  multiplicative 
constant,  in  sub-Gaussian  impulsive  interference  of  unknown  underlying  matrix.  We  defined  the 
sub-Gaussian  class  of  random  processes,  summarized  several  key  results,  and  presented  data- 
adaptive  algorithms  for  coherent  detection  of  a  known  (within  a  multiplicative  constant)  signal. 
Our  derivations  were  based  on  a  solid  mathematical  basis. 

From  this  study,  we  found  that  the  Gaussian  detectors  for  the  same  problem  deteriorate 
in  performance  when  required  to  operate  in  sub-Gaussian  interference.  On  the  other  hand,  a 
detector  based  on  the  multidimensional  Cauchy  distribution  exhibited  resistance  to  the  presence 
of  the  sub-Gaussian  interference  and  high  performance,  comparable  to  the  performance  of  the 
Gaussian  detector  in  Gaussian  interference. 
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2.4  Target  Angle  and  Doppler  Estimation  for  Airborne  Radar 
in  Impulsive  Interference  Modeled  as  a  Stable  Process 

In  this  section  we  develop  target  angle  and  Doppler,  maximum  likelihood-based  estimation  tech¬ 
niques  from  radar  measurements  retrieved  in  the  presence  of  impulsive  noise  modeled  as  a  multi¬ 
variate  SaS  random  process.  We  derive  the  Cramer-Rao  bounds  for  the  additive  SaS  interference 
scenario  to  assess  the  best-case  estimation  accuracy  which  can  be  achieved.  In  addition,  we  in¬ 
troduce  a  new  joint  spatial-  and  doppler-frequency  high-resolution  estimation  technique  based 
on  the  fractional  lower-order  statistics  of  the  measurements  of  a  radar  array.  We  define  the  co¬ 
variation  matrix  of  the  space-time  radar  observation  vector  process  and  employ  subspace-based 
estimation  techniques  to  the  sample  covariation  matrix  resulting  in  improved  target-angle  and 
Doppler  estimates  in  the  presence  of  impulsive  interference.  The  results  are  of  great  importance 
in  the  study  of  space-time  adaptive  processing  (STAP)  for  airborne  pulse  Doppler  radar  arrays 
operating  in  impulsive  interference  environments. 

2.4*1  Maximum  Likelihood  Estimation  in  Alpha-Stable  Noise:  The  Cauchy 
Beamformer 

In  this  section,  we  develop  the  Maximum  Likelihood  (ML)  estimator  of  the  location  and  Doppler 
frequency  of  a  target  in  the  presence  of  noise  modeled  as  a  complex  isotropic  a-stable  process. 
Initially,  we  concentrate  on  the  additive  complex  Cauchy  noise  case.  There  are  two  reasons  for 
doing  this:  First,  the  Cauchy  distribution  has  a  closed-form  expression  for  its  density  function. 
This  results  in  a  straight-forward  implementation  of  the  maximum  likelihood  estimation,  with 
closed  form  expressions  for  the  Cramer-Rao  bound.  Secondly,  it  is  shown  through  simulations 
that  the  Cauchy  beamformer  is  very  robust  in  different  impulsive  noise  environments,  i.e.,  its 
performance  does  not  change  significantly  when  the  parameter  a  of  the  SaS  noise  varies  in  the 
interval  [1,2]. 

A.  Single  Target  Response  Model 

Consider  a  uniformly  spaced  linear  array  radar  antenna  consisting  of  N  elements,  which  transmits 
a  coherent  burst  of  M  pulses  at  a  constant  pulse  repetition  frequency  (PRF)  fr  and  over  a  certain 
range  of  directions  of  interest.  The  pulse  repetition  interval  is  {fr  —  l/Ty.)  A  space-time 
snapshot  refers  to  the  MW  x  1  vector  of  samples  corresponding  to  a  single  range  gate.  Given  a 
target  at  angle  0  and  Doppler  frequency  /,  the  space-time  snapshot  can  be  written  as  [23] 

x  =  /?v(</>,/)  +  n,  (2.54) 

where  /?  is  the  target’s  complex  amplitude  given  by 

P  =  x  +  jy.  (2.55) 

The  vector  v  is  an  NM  x  1  vector  called  the  space-time  steering  vector.  It  may  be  expressed  as 

v(^,/)  =  b(/)®a(<^)  (2.56) 

where  a.{(j))  is  the  x  1  spatial  steering  vector  containing  the  interelement  phase  shifts  for  a 
target  at  (j),  and  b(f)  is  the  M  x  1  temporal  steering  vector  that  contains  the  interpulse  phase 
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shifts  for  a  target  with  Doppler  /.  It  is  assumed  that  the  functional  form  of  v((j!>, /)  is  known. 
For  a  linear  radar  array  whose  elements  are  spaced  a  half- wavelength  apart,  the  following  two 
vectors  are  defined: 

•  Spatial  steering  vector:  a(V»)  =  [1,  e^'^, where  V’  =  ^  cos(<?!))  is  the  normal- 
ized  spatial  frequency  of  the  target,  and 

•  Temporal  steering  vector:  b(a;)  =  . , . ,  where  cv  =  is  the  normalized 

Doppler  frequency  of  the  target. 


The  snapshot  contains  a  noise  component  n  which  includes  clutter,  jamming,  thermal  noise, 
and  any  other  undesired  signals.  We  model  n  as  a  multivariate  Cauchy  process  with  pdf  given 
by  (cf.  (2.8)) 

where  E  is  a  positive-definite  matrix  which  models  the  statistical  dependence  of  the  impulsive 
noise  process,  and  c  =  ^(MN+i)y2r(^^+^).  As  a  first  approximation  to  the  problem,  we  will 
assume  that  the  noise  present  at  the  array  is  statistically  independent  both  along  the  array 
sensors  and  along  time.  In  this  case,  each  component  of  the  noise  vector  is  modeled  as  a  complex 
isotropic  Cauchy  process  with  marginal  pdf  given  by 


X7(^) 


7 

2n{r'^  +  72)3/2  ’ 


(2.58) 


where  7  is  the  noise  dispersion.  Under  the  independence  assumption,  it  follows  from  (2.54)  and 
(2.58)  that  the  joint  density  function  for  the  case  of  a  single  snapshot  is  given  by  [24] 


MN 


/(n)  =  n  fini)  = 

i=l 


yMN 


(2.59) 


where  is  the  ith  component  of  the  space-time  steering  vector  v('0,u;) 

The  estimation  problem  involves  four  real  valued  parameters  which  we  can  arrange  to  form  a 
4x1  parameter  vector 

©  =  [01  02  O3  04]  =  [iP  oj  X  y].  (2.60) 

Given  a  single  snapshot  x  and  ignoring  the  constant  terms,  the  likelihood  function  L(0),  is  given 

by 

o  MN 

i(®)  =  -|  E  loS  P.61) 


B.  Cramer- Rao  Bound  Analysis 

The  Cramer-Rao  bound  (CRB)  for  the  error  variance  of  an  unbiased  estimator  ©  is  given  by 

£;{(©-©)(©-©)^}>  (2-62) 

where  J  (©)  is  the  Fisher  information  matrix  given  by 

J(©)  =  E{[dL{@)/d&][dL{Q)/d@f}.  (2.63) 

Under  the  assumptions  stated  in  Section  2.4.1,  and  for  the  case  of  complex  isotropic  Cauchy 
noise,  the  following  theorem  holds: 
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Theorem  2.1  The  CRB  on  the  error  variance  of  the  target  angle  cf)  and  Doppler  frequency  f 
estimates  is  given  by 


and 


where 


and 


5iV(ll  IP  1 

“  |/?|2  (2;rd)2  ■  3^  ■cos2(<^)’ 


CRB{f) 


7^  1 

W{2TtTry 


5M(||  d°  11^  -Sl/N) 


N 


d“  =  5a/5<^,  \\A^\\^=Y,\dai/dct>\\ 

^  =  l 
M 

d'’  =  dh/df,  II  d**  ||^=  ^ \dbi/duj\^, 


?  =  1 


N 

E 

2  =  1 


M 

i; 

2=1 


MN 


«.  =  Eki.  =  (>  =  EM;(i)ii‘i/(i)i. 


2=1 


^  =  (M  II  d“  IP  -^5l){N  \\  d'  IP  -^Sl)  -  {SJk  -  pf. 


(2.64) 

(2.65) 


Proof  See  Appendix  C.  | 

We  should  note  that  the  above  bounds  can  be  achieved  only  when  there  exist  unbiased  esti¬ 
mators  of  all  the  model  parameters.  A  useful  insight  on  the  CRB  can  be  gained  if  we  consider 
the  case  of  a  single  source  impinging  in  a  linear  array  whose  sensors  are  spaced  a  half-wavelength 
apart.  In  this  case, 


CRB{^)  = 


7^  a2  20  1 

|/?|2  (27rd)2  ■  -  1)  ■  cos2(^)’ 


(2.66) 


and 

'v2  1  20 

CRB{f)  =  |^|2  (27rr,)2  ■  -  1)' 

The  term  7^/|/?P  in  the  above  expressions  for  the  CRB  can  be  viewed  as  the  inverse  of  a  quantity 
analogous  to  the  signal-to-noise  ratio  (SNR)  for  the  Gaussian  case,  i.e.,  a  generalized  SNR,  so  to 
speak.  The  larger  the  dispersion  7  of  the  noise,  the  higher  the  CRB. 


C.  A  Performance  Study 

To  demonstrate  the  performance  of  the  proposed  method  for  the  target  parameter  estimation 
problem,  we  conducted  several  simulation  experiments  where  we  compared  the  ML  estimator 
based  on  the  Cauchy  noise  assumption  (MLC)  with  the  ML  estimator  based  on  the  Gaussian 
noise  assumption  (MLG),  and  with  the  MUSIC  estimator. 

In  all  the  experiments  the  array  is  linear  with  five  sensors  spaced  a  half- wavelength  apart.  A 
single  signal  impinges  to  the  array  from  a  stationary  target  located  at  a  direction  of  5°.  At  first, 
the  signal  is  assumed  to  be  known  at  the  receiver  and  the  ML  method  is  applied  to  estimate  the 
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Table  2.2:  GSNR  and  average  PSNR  for  different  values  of  M. 


Number  of  snapshots^  M 

M  =  5 

M  =  10 

M  =  100 

GSNR  [dB] 

22.7416 

22.5003 

22.6543 

PSNR  [dB] 

7.3712 

3.3938 

0.5371 

-4.3158 

-7.149 

direction  of  arrival.  Also,  for  the  unknown  signal  case  the  pseudo  ML  method  is  applied  by  using 
a  least-squares  estimate  for  the  signal.  The  noise  is  assumed  to  follow  the  bivariate  isotropic 
stable  distribution. 

In  every  experiment  we  perform  500  Monte-Carlo  runs  and  compute  the  mean  square  error 
(MSE)  of  the  direction-of  arrival  (DOA)  estimates.  The  optimization  for  the  MLC  and  MLG 
methods  is  performed  by  a  steepest-descent  algorithm  with  variable  stepsize  selected  by  means  of 
Armijo’s  rule  [25].  Since  the  alpha-stable  family  for  a  <  2  determines  processes  with  infinite  vari¬ 
ance,  we  define  two  alternative  signal-to-noise  ratios  (SNR’s).  Namely,  we  define  the  Generalized 
SNR  (GSNR)  to  be  the  ratio  of  the  signal  power  over  the  noise  dispersion  7: 

.  M 

G5iVi2=101og(— ^|s(0P).  (2.68) 

Also,  for  finite  sample  realizations,  we  define  the  Pseudo-SNR  (PSNR)  as: 

PSNR  =  I01og(^|i|^).  (2.69) 

Et=i  1^(0  P 

In  this  example  we  study  the  estimation  accuracy  of  MLC,  MLG  and  MUSIC  as  a  function 
of  three  parameters,  namely  the  number  of  snapshots  M,  the  noise  dispersion  7,  and  the  noise 
characteristic  exponent  a. 

Number  of  snapshots  M.  In  the  first  experiment  we  study  the  influence  of  the  number 
of  snapshots  M  to  the  performance  of  the  algorithms.  The  noise  follows  the  complex  isotropic 
Cauchy  distribution  with  dispersion  7=1.  For  this  experiment,  the  GSNR  is  kept  almost 
constant  at  22.5  dB  as  shown  on  Table  2.2.  The  PSNR  is  different  in  every  Monte-Carlo  run,  so 
we  calculate  the  average  PSNR  over  the  500  Monte-Carlo  runs  (cf.  Table  2.2).  As  the  number  of 
snapshots  M  increases,  the  PSNR  decreases  because  more  and  more  impulsive  noise  samples  are 
incorporated  into  the  data. 

Figure  2.21(a)  shows  the  resulting  MSE  of  the  estimated  DOA  as  a  function  of  the  number  of 
snapshots  when  the  signal  is  known.  The  CRB  is  also  plotted.  As  expected,  the  MLC  estimator 
has  the  best  performance  since  it  is  the  optimal  estimator  for  this  type  of  noise.  Also,  the 
complete  failure  of  the  MLC  and  MUSIC  processors  for  this  type  of  impulsive  noise  is  apparent. 
Figure  2.21(b)  shows  similar  plots  for  the  case  of  the  pseudo  ML  estimators  where  we  use  a  LS 
estimate  for  the  signal.  The  MLC  estimator  has  again  the  least  MSE.  Comparing  these  curves 
with  the  analogous  curves  obtained  assuming  exact  signal  knowledge,  we  observe  a  larger  MSE 
for  the  pseudo  ML  estimates,  as  expected. 

Noise  dispersion  7.  In  the  second  experiment  we  study  the  influence  of  the  noise  dispersion 
7,  i.e.,  the  influence  of  the  GSNR  to  the  performance  of  the  methods.  Here  also,  the  noise  follows 
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Table  2.3:  GSNR  and  average  PSNR  for  different  values  of  7. 


Noise  Dispersion^  7 

7  =  0.5 

7=1 

7  =  2 

7  =  4 

7  =  6 

7  =  8 

7  =  10 

GSNR  [dB] 

25.7519 

22.7416 

19.7313 

16.7210 

14.9601 

13.7107 

12.7416 

PSNR  [dB] 

6.5577 

0.5371 

-5.4835 

-11.5041 

-15.0259 

-17.5247 

-19.4629 

Table  2.4:  GSNR  and  average  PSNR  for  different  values  of  a. 


Noise  Characteristic  Exponent,  a 

a=  1.0 

a=  1.2 

P 

11 

P 

II 

p 

II 

bo 

a  =  2.0 

GSNR  [dB] 

22.7416  (7  =  1) 

PSNR  [dB] 

0.5371 

6.2171 

10.1869  13.1035  15.3269 

20.0383 

the  bivariate  isotropic  Cauchy  distribution  with  dispersion  7.  The  number  of  snapshots  available 
to  the  algorithms  is  M  =  20.  The  GSNR  and  average  PSNR  for  this  experiment  are  shown  on 
Table  2.3. 

Figure  2.22  shows  the  resulting  MSE  of  the  estimated  DOA  as  a  function  of  the  GSNR.  Again 
the  MLC  estimate  has  the  best  performance.  As  evident  in  Figure  2.22(b),  the  performance  of 
the  MLC  using  a  LS  estimate  for  the  signal  degrades  more  rapidly  for  large  values  of  the  noise 
dispersion,  7  (low  GSNR  values). 

Characteristic  exponent  a.  The  importance  of  this  experiment  rests  in  its  study  of  the 
robustness  of  the  algorithms  in  different  noise  environments.  Of  course,  by  design,  the  MLG 
estimator  is  optimal  for  additive  Gaussian  noise  (a  =  2),  and  the  introduced  MLC  estimator  is 
optimal  for  additive  Cauchy  noise  (a  =  1).  An  important  property  of  any  processor  is  to  be  able 
to  perform  reasonably  well  in  a  wide  range  of  noise  environments  (1  <  a  <  2).  Here,  we  test 
the  performance  of  the  estimators  when  the  characteristic  exponent,  a,  of  the  noise  stable  law  is 
changing. 

Figure  2.23  shows  the  resulting  MSE  curves  as  functions  of  the  characteristic  exponent  a.  The 
number  of  snapshots  available  to  the  algorithms  is  M  =  20.  The  GSNR  is  22.7416  dB  (7  =  1) 
and  is  shown  together  with  the  average  PSNR,  on  Table  2.4.  The  CRB  as  a  function  of  a  was 
computed  as  shown  in  [24]. 

As  we  can  clearly  see,  the  Cauchy  beamformer  is  practically  insensitive  to  the  changes  of  a, 
and  for  exact  signal  knowledge  it  almost  achieves  the  CRB  for  the  whole  range  of  values  a.  On  the 
other  hand,  both  the  MLG  and  the  MUSIC  algorithms  exhibit  very  large  mean-square  estimation 
errors  for  non-Gaussian  noise  environments.  Note  that  when  a  =  2,  i.e.,  for  the  Gaussian  noise 
case,  the  MLG  method  has  the  least  MSE,  as  expected. 

The  experiment  demonstrates  that  for  values  of  a  in  [1,2],  the  ML  method  based  on  the 
Cauchy  noise  assumption  exhibits  performance  very  close  to  the  optimum.  This  observation, 
combined  with  the  fact  that  the  ML  method  based  on  the  Cauchy  assumption  has  computational 
complexity  similar  to  the  ML  method  based  on  the  Gaussian  assumption,  justify  the  importance 
of  the  Cauchy  beamformer  for  the  DOA  estimation  problem  in  practice. 
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D.  Concluding  Remarks 

We  considered  the  problem  of  target-angle  and  Doppler  estimation  with  an  airborne  radar  em¬ 
ploying  space-time  adaptive  processing.  We  derived  Cramer-Rao  bounds  on  angle  and  Doppler 
estimator  accuracy  for  the  case  of  additive  multivariate  Cauchy  interference  of  known  underlying 
matrix.  The  bounds  are  functions  of  a  generalized  SNR  function,  similarly  to  the  Gaussian  case 
where  the  bounds  are  functions  of  the  SNR.  The  proposed  method  is  based  on  the  maximum 
likelihood  estimation  technique  where  the  noise  is  modeled  as  a  complex  isotropic  SaS  process. 
The  Cauchy  beamformer  has  been  shown  to  give  better  bearing  estimates  than  the  Gaussian 
beamformer  in  a  wide  range  of  impulsive  noise  environments,  and  for  very  low  SNR  values. 

2,4.2  Subspace-Based  STAP  Using  Lower-Order  Statistics 

In  this  section,  we  present  a  new  subspace-based  method  for  joint  spatiab  and  doppler-frequency 
high-resolution  estimation  in  the  presence  of  impulsive  noise  which  can  be  modeled  as  a  complex 
symmetric  alpha-stable  (SaS)  process.  We  define  the  covariation  matrix  of  the  array  sensor  out¬ 
puts  and  show  that  eigendecomposition-based  methods,  such  as  the  MUSIC  algorithm,  can  be 
applied  to  the  sample  covariation  matrix  to  extract  the  doppler/angle  information  from  the  mea¬ 
surements.  The  improved  performance  of  the  proposed  source  localization  method  in  the  presence 
of  a  wide  range  of  impulsive  noise  environments  is  demonstrated  via  Monte  Carlo  experiments. 

A.  STAP  Problem  Formulation 

Consider  a  uniformly  spaced  linear  array  radar  antenna  consisting  of  N  elements,  which  transmits 
a  coherent  burst  of  M  pulses  at  a  constant  pulse  repetition  frequency  (PRF)  fr  and  over  a  certain 
range  of  directions  of  interest.  The  array  receives  narrow-band  signal  returns  reflected  by  q  moving 
targets  which  are  located  at  azimuth  angles  {9^]  A;  —  1, . .  .,^}  and  have  relative  velocities  with 
respect  to  the  radar  {uk]  k  =  1, . . .,  g}  corresponding  to  Doppler  frequencies  fc  =  1, . . . ,  g}. 
Since  the  signals  are  narrow-band,  the  propagation  delay  across  the  array  is  much  smaller  than  the 
reciprocal  of  the  signal  bandwidth,  and  it  follows  that,  by  using  a  complex  envelop  representation, 
the  array  output  can  be  expressed  as  [23]: 

x{t)  -  V{0,zu)s{t)  -h  n(i),  (2.70) 


where 

•  x(/:)  =  [a^i(^), . .  ^^xmn{^)V  ^he  array  output  vector  [N:  number  of  array  elements,  M: 
number  of  pulses,  t  may  refer  to  the  number  of  the  coherent  processing  intervals  (CPI’s) 
available  at  the  receiver); 

•  s(^)  =  [si{t)^ , . .  ySq{t)Y'  is  the  signal  vector  emitted  by  the  sources  as  received  at  the 
reference  sensor  1  of  the  array; 

•  V(0,  zcj)  =  [v(i?i,  rj7i), . . . ,  v(i?g,  ^Ug)]  is  the  space-time  steering  matrix  {zuk  =  ^); 

•  Space-Time  steering  vector:  \{'0k)^k)  — 

-  3i['dk)  =  [1?  . . . ,  is  the  spatial  steering  vector  {'dk  =  cos(0A:)); 

-  h{wk)  =  [1,  . . . ,  is  the  temporal  steering  vector. 
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•  n(i)  =  [ni(i), .  ..,nMN{t)Y  is  the  noise  vector. 

Assuming  the  availability  of  P  CPI’s  ii, . . . ,  tp,  the  data  can  be  expressed  as 

X  =  V(0,'c:7)S  +  N,  (2.71) 

where  X  and  N  are  the  MN  X  P  matrices 

X  =  [x(ti),...,x(tp)],  (2.72) 

N  =  [n(ti), . .  .,n(tp)],  (2.73) 

and  S  is  the  g  X  P  matrix 

S  =  [s(ti),...,s(tp)].  (2.74) 

Our  objective  is  to  jointly  estimate  the  directions-of-arrival  k  =  1, . .  and  the  Doppler 
frequencies  {fk',  k  =  1, . .  .,q}  of  the  source  targets. 


B.  The  Array  Covariation  Matrix 

We  will  assume  that  the  q  signal  waveforms  are  non-coherent,  statistically  independent,  complex 
isotropic  SaS  (1  <  o;  <  2)  random  processes  with  zero  location  parameter  and  covariation  matrix 
Ts  =  diag(7si, . .  .,7sJ.  Also,  the  noise  vector  n{t)  is  a  complex  isotropic  SaS  random  process 
with  the  same  characteristic  exponent  a  as  the  signals.  The  noise  is  assumed  to  be  independent 
of  the  signals  with  covariation  matrix  Tjv  =  7nl. 

Equation  (2.70)  can  be  written  as 

x(t)  =  w(t)  +  n(t),  (2.75) 

where  w(t)  =  V(©, -a7)s(t).  By  the  stability  property,  it  follows  that  w(t)  is  also  a  complex 
isotropic  SaS  random  vector  with  components 

u;-(t)  =  Vi(©,Tr7)s(t)  =  +  ■  ■  ■  +  Vi{§g,Wg)sq{t)  i  =  l,...,MN.  (2.76) 

Also,  it  holds  that  w{t)  is  independent  of  n(t). 

Now,  we  define  the  covariation  matrix,  Txi  of  the  observation  vector  process  x(t)  as  the 
matrix  whose  elements  are  the  covariations  [xi{t),Xj{t)]a  of  the  components  of  x(t).  We  have 
that 

[xi{t),xj{t)]a  =  [wi{t)  +  ni{t),Wj{t)  +  nj{t)]c 

—  [wi{t),  Wj{t)]a  +  [wi{t),  nj{t)]a  +  [ni{t),  'Wj{t)]a  +  [ni{t) ,  nj{t)]a-  (2.77) 

By  the  independence  assumption  of  w(t)  and  n(t)  and  by  property  P3  we  have  that 

[w^{t),nj{t)]a  =  0,  (2-78) 


and 


[ni{t),Wj{t)]a  =  0. 


(2.79) 
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Also,  by  using  (2.76)  and  properties  PI  and  P2  it  follows  that 

k=l 

k=l 

Q  9 

=  XI  ^k)[sk{t),Yl 

k=l  1=1 

9 

=  Yl^ii'^k,^k)vf°‘~^^{^k,‘^khsk, 

k=l 


(2.80) 


where  7^^  =  [sjt,  Finally,  due  to  the  noise  assumption  made  earlier,  it  holds  that 

^ ~  Tn^Zjjj  (2.81) 

where  6ij  is  the  Kronecker  delta  function.  Combining  (2.77)-(2.81)  we  obtain  the  following 
expression  for  the  covariations  of  the  sensor  measurements: 


[Xi{t),Xj{t)]a  =  ^Vi{'&k,^k)vf°‘  ^^{'&k,^k)'ysk+lnSi,j  i,  j  =  1, . . . ,  MN.  (2.82) 

k=l 


In  addition,  the  dispersion  and  covariation  coefficients  of  the  array  sensor  measurements  are  given 
respectively  by 

9 

=  X  j  =  l,-.-,MN,  (2.83) 

k=l 

and 


A 


Xi{t),Xj{t) 


ELl  ^k)vf°‘  ^^i'&k,  ^k)7s^  +  Jnki 

Efel  ^fc)l'^7sfc  +  7n 


(2.84) 


In  matrix  form,  (2.82)  gives  the  following  expression  for  the  covariation  matrix  of  the  observation 
vector: 

Tx  =  [x(0,x(t)]„  =  V(©,rs7)r5V<“-i>(0,z^)  +  7nI,  (2.85) 

where  the  (i,i)th  element  of  matrix  i:i7)  results  from  the  (i,  i)th  element  ofV(0,t:c7) 

according  to  the  operation 

[V<“-1>  (0,  =  [V(0,  v,)]<^-^>  =  I  [V(0,  zrr)]^  .  [V(0,  ZU)]*.  (2.86) 

Clearly,  when  Of  =  2,  i.e.,  for  Gaussian  distributed  signals  and  noise,  the  expression  for  the 
covariation  matrix  is  identical  to  the  well-known  expression  for  the  covariance  matrix: 

R;^  =  v(0,  -ca)SV^(0,  -ccr)  +  CT%  (2.87) 


where  S  is  the  signal  covariance  matrix. 
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When  the  amplitude  response  of  the  sensors  equals  unity,  it  follows  that 


[V<“-1>(0,  =  [V(©,  zct)]*  . ,  (2.88) 

and  thus  the  covariation  matrix  can  be  written  as 

Tx  =  V(©,  xu)TsV^{e,  -k;)  +  7nl.  (2.89) 

Also,  from  (2.83)  and  (2.84)  the  dispersion  and  covariation  coefficients  of  the  array  sensor  mea¬ 
surements  can  be  written  as 


Jxj{t)  =  j2lsk+'rn  j  =  l,...,MN, 

k=l 


(2.90) 


and 


A 


i,j  =  1,...,MA^. 


(2.91) 


ELl  ^k)Vj{'dk,  ^k)7sk  +  InSiJ 

Xi(t),Xj{t)  —  'V  -l-'V 

^k=l  l^k  '  7n 

Observing  (2.89),  we  conclude  that  standard  subspace  techniques  can  be  applied  to  the  covari¬ 
ation  or  the  covariation  coefficient  matrices  of  the  observation  vector  to  extract  the  bearing  infor¬ 
mation.  More  specifically,  it  follows  that  the  rank  of  matrix  V (0,  •ci7)r5V^(0,  zu)  is  q,  with  the 
smallest  {MN  —  q)  of  its  eigenvalues  equal  to  zero.  In  other  words,  if  we  let  pi  >  /92  >  •  •  •  >  Pmn 
denote  the  eigenvalues  of  matrix  Fx ,  then 


PqJrl  =  •••  =  PMN  =  In- 


(2.92) 


By  denoting  the  corresponding  eigenvectors  of  F^  by  the  ROC-MUSIC  spectrum  can 

be  expressed  as 

(2.93) 


Sroc-music{'&,^)  =  ^mn 


Etf+l  |v^(^,t^)Uii2 


The  locations  of  the  source  signals  are  determined  by  the  values  of  6  for  which  the  spectrum 
given  by  (2.93)  peaks. 

In  practice,  we  have  to  estimate  the  covariation  matrix  from  a  finite  number  of  array  sen¬ 
sor  measurements.  A  proposed  estimator  for  the  covariation  coefficient  is  called  the 

fractional  lower  order  (FLOM)  estimator  and  is  given  by  [26] 


A 


Xi{t),Xj(t) 


E?=l  Xi{t)xf-^>{t) 

E?=ik,'(i)l^ 


(2.94) 


for  some  0  <  p  <  ft.  We  will  refer  to  the  new  algorithm  resulting  from  the  eigendecomposition  of 
the  array  covariation  coefficient  matrix  as  the  Robust  Covariation-Based  MUSIC  or  ROC- 


MUSIC. 


C.  Simulations 

In  this  section,  we  study  the  resolution  capability  and  estimation  accuracy  of  ROC-MUSIC  versus 
MUSIC  as  functions  of  the  noise  characteristic  exponent  ft,  and  the  separation  of  the  two  sources. 
The  array  is  linear  with  five  sensors  spaced  half  wavelength  apart  [N  =  5).  The  number  of 
transmitted  pulses  is  M  =  10.  The  noise  follows  the  bivariate  isotropic  stable  distribution. 
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Table  2.5:  GSNR  and  average  PSNR  for  different  values  of  a. 


Noise  Characteristic  Exponent  a 

a  =  1.01 

a  =  1.2 

a  =  1.4  a  =  1.6 

P 

II 

bo 

a  =  2.0 

GSNR  [dB] 

22.3  (7  =  1) 

PSNR  [dB] 

-17.9690 

-7.8245 

-2.8001  -0.8380 

-0.2164 

-0.0441 

Characteristic  exponent  a.  In  this  experiment,  we  evaluate  the  performance  of  the  two 
algorithms  in  a  wide  range  of  noise  environments,  from  the  more  impulsive  {a  in  the  neighborhood 
of  1)  to  the  Gaussian  ones  {a  =  2),  The  angles  of  arrival  for  the  two  signals  are  0i  =:  —5*^  and 
$2  =  5^.  The  GSNR  is  22.3  dB  (7  =  1)  and  is  shown  together  with  the  average  PSNR  on 
Table  2.5.  The  characteristic  exponent  a  of  the  additive  noise  is  unknown  to  the  ROC-MUSIC 
algorithm.  We  use  two  values  of  the  parameter  p  in  the  estimation  of  the  covariation  matrix  (cf. 
(2.94)):  p  ==  0.8  and  p  =  0.4.  Clearly,  MUSIC  can  be  thought  as  a  special  case  of  ROC-MUSIC 
with  p  =  2. 

In  Figures  2.24-2.28,  space-time  spectral  estimates  obtained  in  five  independent  trials  are 
shown  for  the  ROC-MUSIC  and  MUSIC  algorithms.  Five  types  of  alpha  stable  noise  corre¬ 
sponding  to  three  values  of  the  characteristic  exponent  a  =  1.1,  1.3,  1.5,  1.8,  2.0  were  used. 
We  can  see  that  the  MUSIC  method  exhibits  high-resolution  performance  only  for  the  case  of 
Gaussian  additive  noise.  On  the  other  hand,  the  ROC-MUSIC  method  exhibits  better  resolution 
capabilities  for  non-Gaussian  additive  noise  environments. 

Figure  2.29  depicts  the  improved  performance  of  ROC-MUSIC  over  that  of  MUSIC  both 
in  terms  of  resolution  probability  and  MSE,  for  values  of  a  in  the  range  (1,2).  Note  that  for 
a  <  1.2,  MUSIC  does  not  resolve  the  two  sources  in  any  of  the  200  Monte  Carlo  runs.  The 
results  suggest  that  in  impulsive  noise  environments  modeled  under  the  stable  law,  it  is  more 
beneficial  to  use  the  covariation  matrix  (lower-order  moments)  instead  of  the  covariance  matrix 
(second-order  moments).  Of  course,  for  Gaussian  additive  noise  (a  =  2)  the  use  of  second-order 
moments  (p  ~  2)  gives  better  results. 

Angular  separation.  The  resolution  analysis  of  the  two  algorithms  was  studied  by  using  a 
popular  resolution  criterion  defined  by  the  following  threshold  equation: 

A(01, ^2)  =  PiOm)  -  l{P{0i)  +  P{02)}  >  0,  (2.95) 

where  9i  and  62  are  the  angles  of  arrival  of  the  two  signals,  0m  =  {0i  +  ^2)/2  is  the  mid-range 

between  them,  and  the  null  spectrum  P{0)  =  1/S{0)  is  defined  as  the  reciprocal  of  the  spatial 
spectrum  S{0)  given  in  (2.93).  The  two  signals  are  said  to  be  resolvable  if  inequality  (2.95)  holds. 
The  inequality  implies  that  the  null  spectrum  magnitude  at  the  mid-angle  should  lie  above  the 
line  segment  linking  the  two  signal  valleys,  in  order  for  the  two  sources  to  be  resolvable.  In  our 
experiments,  we  estimated  the  null  spectrum  from  a  finite  number  of  array  sensor  measurements 
and  we  determined  the  probability  of  resolution  by  averaging  the  resolution  events  over  the  200 
independent  Monte  Carlo  runs. 

Figure  2.30  illustrates  the  variation  of  the  algorithmic  performance  with  respect  to  the  spatial 
angle  separation  of  the  two  incoming  signals  for  GSNR=  22.3  dB,  PSNR(a  =  1.5)  =  —1.56  dB, 
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and  PSNR(Qf  =  1.8)  =  -0.22  dB.  As  expected,  the  resolution  capability  of  both  algorithms 
improves  with  increased  angle  separation  between  the  two  sources.  But  for  a  given  probability 
of  resolution,  the  ROC-MUSIC  algorithm  requires  a  lower  angle  separation  threshold  than  the 
MUSIC  algorithm. 
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Appendix  A:  Proof  of  Proposition  2.3 


Define 


k=l 

=  jhxfr 

k=l 

Tz  = 

k^l 

Ti  = 

k=l 


■  Ti  ■ 

-  S{Xi{Xi)<r-^>] 

The  vector 

T2 

Tz 

is  asymptotically  (as  K  — >  oo)  normal  with  mean  mx  _ 

£{\XiY] 

£{X^{Xn)<^-^>] 

Ti 

r  1 

and  correlation  matrix  Rrp  = 


r\\ 

ri2 

ri3 

^14 

^21 

r22 

r23 

7’24 

r^i 

r32 

r33 

^34 

^41 

r42 

^43 

r44 

where 


r’li  = 

ru  =  ^S{Xi{Xj)<^-^>\Xjn 

m  =  ^S{Xi{Xj)<^-^>Xm{Xn)<^-^^} 

ri4  =  ^S{Xi{Xj)<^-^>\Xnn 

^21  =  ri2 

r22  =  ^£{\Xj\^n 

r23  =  ^£{Xm{Xn)<^-^^\Xjn 

r24  =  ^SiWlX^n 

r3i  =  f’ls 
1'32  =  ^23 

r33  =  ^£{\Xm\^\Xn\^^-"} 

r34  =  j^S{XmiXn)<^-^^\Xnn 

ru  =  rl^ 
ru  =  f'24 


69 


r43  =  ^34 

r44  =  ^e{\Xnn. 


9liui,U2,U3,U4)  =  C{p,a)Ui^%2^^  ^ 

fif2(«i,«2,W3,  W4)  =  C{p,a)Us^%‘^^^~^ 


dai  dgi  ^  a^L 

and  define  g=  |S1  gl  .where 

.  dui  du2  dus  du4  . 


=  C{p,a)-{uiU2)°‘^^  ^ 
P 

=  C{p,a)i^-l)u[ut 


^  =  C(p,<.)^(«3a4)”''-‘ 


and  G  is  evaluated  at 


=  mT. 


TVron-  ^*.3  -  9i{Ti,T2,T3,T4) 

[Cran\~  [  92{TuT2,T3,T4) 
and  covariance  matrix  QJyRj  — 


will  be  asymptotically  normal  with  mean  ~ 

w  m  n 


Appendix  B:  Proof  of  Proposition  2.4 

From  the  results  in  Appendix  A  and  (2.42),  we  immediately  see  that  Rjj  is  asymptotically  normal 
with  mean  Rjj  and  variance  where  ^crcjj  is  the  asymptotic  error  variance  in 

the  estimate  Cij. 

2z:« 

Let  now /i(ui,  U2)  =  2uiU2“  •  Then: 
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Therefore: 


dh  o  ^ 

—  =  2ui{ -  • 

dU2  O' 


2— g  2  —  Q?  2— 2a  _  (J^.  -f7  /* 

4.,  =  [2c;,~,2( - )c’;,“  i^aa  . 

Appendix  C:  Derivation  of  CRB  for  Complex  Isotropic  Cauchy 
Noise 

The  log-likelihood  function  is  given  by  (2.61) 


L{@)  =  --zYl  *°S 


(2.96) 


The  derivatives  of  L{@)  with  respect  to  x,  y],  are  as  follows 

dii  +\nir  of 

O  MN  . 

^  7^  +  l^iP 


where  d“  =  dai/df.  Likewise,  we  obtain: 


du>  “  7^  +  |7iiP 


where  d-  =  dhijdu:.  We  also  have  that 


ivi  ly 


^  7^  +  l^t'P 


(2.97) 
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Now,  we  have  to  calculate  the  components  of  the  Fisher  information  matrix. 


.dij)) 


\hh  7^+kP  ■  ^7^+kp 


(2.98) 


From  the  assumption  made  in  Section  2.4.1,  and  observing  that  {|ai|  =  1,  i  =  1,  2,  •  •  •,  (A?'  —  1)} 
and  {|6,|  =  1,  i  =  1, 2,  •  •  • ,  (M  —  1)},  we  obtain: 


.drp) 


where: 


l£t  (7^+klT 


f  iMdg{i)\\ni\ cos{-(f>i3  -  -  7r/2  +  ^f))^ 

h  1  (7^  +  |n,P)2 


(j)^  =  arctan 


(j)p  =  arctan  ( — 


(2.99) 


In  addition, 


(j>l  =  arctan 


(j>^  =  arctan 


^[ai]J  * 


(2.100) 
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r  i(7^+KiT 


{cos2(-^^  -  -  n/2  +  C 


(2.101) 


Using  the  trigonometric  propriety  cos^('!?)  =  |(1  +  cos(2'!?)),  we  obtain: 


\dg{Y 

•  i=l  ^ 

O  MN 

E  Ki)f 


-(1  +  cos(-2^/3  -  2<j!>y(j-)  -  -  TT  +  2^")  I 


8M 

'  2  =  1 


In  the  same  way  we  can  find: 


H/JPEMiP. 


Now,  we  can  obtain 
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Likewise  we  obtain: 


Now: 
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15722  572 
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Likewise  we  obtain: 


^jSLaLx^j 
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Now: 
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Likewise  we  obtain: 


dLdL\  3 


(2.105) 


Finally, 


iMMi  ^  te a[»;, ] 

'  dx  dy]  7^  +  l^iP  T^  +  l’^iP  j 

MA?"  (  I  -l^ 
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=  0. 

The  Fisher  Information  matrix  will  be: 


M|/?|2  II  d“  ||2  |/3|2p  yM5a  xM5a 

|/?|2p  iVl/ini  d*’ ||2  yNSb  xN8b 


yMSa 

xMSa 


yNSt  MN  0 

xNSb  0  MN 


where  ||  d^  |r=  Eili  Idai/d-tpl^  and  ||  d(,  |F=  Ei=i  \dbi/du\^,  and: 

^a^Eld^l, 

1=1 


<^6  =  ^1^^'!- 


(2.106) 


(2.107) 

(2.108) 


/>=EK(oII4(oI- 


(2.109) 


Using  the  matrix  inversion  lemma  [27]  and  after  some  simple  matrix  multiplication  we  have 

=  ra.iioi 


SaSb  -  P 


a  112  M 


where 


f=(M  ||d“|p-f©(JV  ||d‘|p-;^4?) 


1 

(2.110) 

,-p?- 

(2.111) 

From  (2.110)  the  bounds  on  the  variances  of  the  spatial  and  temporal  frequency  estimates  are 
obtained  as 

™  =  (2.112, 


6M(||  d“  ||1 -4|/JV) 

CRB(u)  =  pp - , 


(2.113) 
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Then,  by  the  invariance  principle  for  maximum  likelihood,  we  can  find  the  bounds  tor  a.nd  /. 
If  we  let  w  =  W(0),  we  can  express  the  Cramer- Rao  bound  for  the  estimator  w  by  means  of  the 
Cramer-Rao  bound  for  0  [28].  In  fact,  we  have  the  follow  property  for  the  two  Fisher  information 


matrices: 

J(w)  =  GJ(0)G^, 


(2.114) 


where  G  =  {^dO j / dwi\ .  In  our  case  G  is  diagonal 


cos  (f)  0  0  0 

0  2'KTr  0  0 

0  0  10 

0  0  0  1 


(2.115) 


Using  (2.106),  (2.114)  and  again  the  partitioned  inverse  matrix  lemma  we  obtain  the  bounds  in 
(2.64)  and  (2.65).  I 
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l-component  of  the  clutter  data  on  Sensor#1  at  15  degrees  (scaled  by  1e+5). 


Q-component  of  the  clutter  data  on  Sensor#  1  at  15  degrees  (scaled  by  1e+5) 


500  1000 

sample  number 


1500  0 


500  1000 

sample  number 


Figure  2.4:  Measured  I/Q-components  of  radar  clutter:  Azimuth:  15°,  estimated  mean:  0.0043  + 
jO.0099,  estimated  a  =  0.7318,  7  =  0.0413. 
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Figure  2.5:  Comparison  of  the  empirical,  SaS,  and  Gaussian  amplitude  probability  distributions. 


clutter 


l-component  of  the  clutter  data  on  Sensor#1  at  45  degrees  (scaled  by  1e+5). 


Q-component  of  the  clutter  data  on  Sensor#1  at  45  degrees  (scaled  by  1e+5). 
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Figure  2.6:  Measured  I/Q-components  of  radar  clutter:  Azimuth:  45°,  estimated  mean: 
—0.0121  —  j0.0247,  estimated  a  =  0.8301,  7  =  0.0537. 
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Figure  2.7:  Comparison  of  the  empirical,  SoS,  and  Gaussian  amplitude  probability  distributions. 
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15 


l-component  of  the  clutter  data  on  Sensor#1  at  60  degrees  (scaled  by  1e+5). 


Q-component  of  the  clutter  data  on  Sensor#1  at  60  degrees  (scaled  by  1e+5). 


Figure  2.8:  Measured  I/Q-components  of  radar  clutter:  Azimuth:  60°,  estimated  mean:  0.0051  + 
j0.005,  estimated  a  =  0.6886,  7  =  0.0394. 


Figure  2.9:  Comparison  of  the  empirical  APD  with  Gaussian  and  SaS  models. 
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l-component  of  the  clutter  data  on  Sensor#1  at  105  degrees  (scaled  by  1e+5). 


Q-component  of  the  clutter  data  on  Sensor#1  at  105  degrees  (scaled  by  1e+5). 
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Figure  2.10:  Measured  I/Q-components  of  radar  clutter:  Azimuth:  105°,  estimated  mean:  0.028- 
j0.0241,  estimated  a  =  1.1045,  7  =  0.0249. 
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Figure  2.11:  Comparison  of  the  empirical  APD  with  Gaussian  and  SaS  models. 
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l-component  of  the  clutter  data  on  Sensor#1  at  120  degrees  (scaled  by  1e+5). 


Q-component  of  the  clutter  data  on  Sensor#1  at  120  degrees  (scaled  by  1e+5) 
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Figure  2.12:  Measured  I/Q-components  of  radar  clutter:  Azimuth:  120°,  estimated  mean: 
-0.0199  +  jO.0215,  estimated  a  =  1.1314,  7  =  0.0171. 
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Figure  2.13:  Comparison  of  the  empirical  APD  with  Gaussian  and  SaS  models. 
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l-component  of  the  clutter  data  on  Sensor#1  at  315  degrees  (scaled  by  1e+5). 


Q-component  of  the  clutter  data  on  Sensor#1  at  315  degrees  (scaled  by  1e+5). 


alpha  =  2,  Gaussian  estimate 


alpha  =  2,  FLOS-based  estimate 


Figure  2.18:  Illustration  of  the  performance  of  estimators  of  the  underlying  matrix  of  a  subGaus- 
sian  vector. 


Probability  of  False  Alarm  Probability  of  False  Alarm 

Figure  2.19:  Comparison  of  the  small  sample  performance  of  the  Gaussian  (dotted  line)  and  the 
Cauchy  (solid  line)  detector. 
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10"'  10°  10''  10° 
Probability  of  False  Alarm  Probability  of  False  Alarm 


Figure  2.20:  Performance  of  the  Gaussian  (left  column)  and  the  Cauchy  (right  column)  detector 
as  a  function  of  the  characteristic  exponent  a. 
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MEAN  SQUARE  ERROR 


Figure  2.21:  MSB  of  the  estimated  DOA  and  CRB  as  functions  of  the  number  of  snapshots  M. 
(a)  Exact  signal  knowledge,  (b)  Least-squares  estimate  of  the  signal. 


Figure  2.22:  MSB  of  the  estimated  DOA  and  CRB  as  functions  of  the  GSNR.  (a)  Exact  signal 
knowledge,  (b)  Least-squares  estimate  of  the  signal. 


SQUARE  ERROR 


Figure  2.23:  MSE  of  the  estimated  DOA  and  CRB  as  functions  of  the  characteristic  exponent  a. 
(a)  Exact  signal  knowledge,  (b)  Least-squares  estimate  of  the  signal. 


MUSIC  (N=5.  M=10.  TH=[-20  -40  40],  D=[-.3  -.2  .3],  a=1.8) 


MUSIC  (N=5,  M=1 0,  TH=[-20  -40  40],  D=t-.3  -.2  .3],  a=1 .8) 


ROC-MUSIC  (N=5.  M=1 0.  TH=[-20  -40  40],  D=[-.3  -.2  .3],  a=1 .8) 


ROC-MUSIC  (N=5,  M=10.  TH=[-20  ^0  40],  D=[-.3  -.2  .3],  a=1. 


(d) 


Figure  2.27:  MUSIC  (a-b)  and  ROC-MUSIC  (c-d)  angle-Doppler  spectra  (iV  =  5,  M 
©  =  [-20°,  -40°,  40°]  D  =  [-0.3,  -0.2, 0.3]).  Additive  stable  noise  {a  =  1.8,  7  =  4). 
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(a) 


Figure  2.29:  Probability  of  resolution  (a)  and  mean  square  error  (b)  as  functions  of  the  charac¬ 
teristic  exponent  cv. 
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Part  3 


Scalable  Portable  Parallel 
Algorithms  for  STAP  Applications 


Investigator:  Professor  Viktor  K.  Prasanna 

Collaborator:  Young  Won  Lim 


To  take  advantage  of  High  Performance  Computing  (HPC)  platforms  for  radar  signal  process¬ 
ing,  scalable  parallel  algorithms  and  portable  implementations  are  essential.  The  algorithmic 
techniques  must  exploit  the  HPC  architectures.  These  techniques  will  be  significantly  different 
from  earlier  fine  grain  custom  VLSI  approach  [5,  6].  In  this  part,  scalable  portable  algorithms 
on  HPC  platforms  are  developed  for  Higher-Order  Post-Doppler  (HOPD)  STAP.  This  problem 
is  computationally  challenging  because  of  the  amount  of  computations  to  be  performed  and  the 
need  for  real-time  performance. 

We  assume  a  general  model  of  High  Performance  Computing  platforms  to  develop  our  al¬ 
gorithms  and  perform  their  scalability  analysis.  Our  algorithms  are  written  in  MPI  (Message 
Passing  Interface)  and  they  can  be  easily  ported  to  various  HPC  platforms.  In  HOPD,  Doppler 
processing  (FFT  computations)  is  followed  by  solving  least  square  problems  (QR  Decomposi¬ 
tions).  These  steps  comprise  most  of  the  computation  time  in  HOPD  processing.  Hoever,  the 
data  is  accessed  in  orthogonal  directions  during  these  steps.  Instead  of  finding  a  fixed  single 
data  mapping  scheme,  we  use  two  different  data  mapping  schemes  so  that  all  the  necessary  data 
can  be  contained  within  a  single  processor  while  performing  computations  during  these  steps.  In 
other  words,  by  data  remapping  between  Doppler  processing  and  QR  decomposition,  we  derive 
scalable  algorithms  for  HOPD  STAP. 

The  QR  decomposition  step  causes  the  major  computational  bottleneck  in  HOPD  STAP.  If 
state-of-the-art  parallel  QR  decomposition  algorithms  such  as  ScaLAPACK  [7]  are  used,  then 
frequent  short  message  communications  are  inevitable  because  of  sequential  bottlenecks  in  such 
algorithms.  Also,  the  communication  time  of  the  HOPD  STAP  implementation  increases  as 
the  number  of  processors  increases  due  to  the  collective  communication  patterns  in  the  QR 
decomposition  step.  Hence,  the  efficiency  of  these  algorithms  becomes  very  low  for  small  size 
matrices  (less  than  1000  x  1000)  which  are  typically  encountered  in  HOPD  STAP.  It  should  be 
noted  that  many  QR  decompositions  are  to  be  solved  successively  in  HOPD  STAP.  Thus,  the 
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cumulative  parallelizing  overheads  of  these  algorithms  are  too  excessive  to  be  used  for  parallelizing 
HOPD  STAR 

Our  scalability  analysis  shows  that  the  communication  time  of  the  proposed  algorithm  de¬ 
creases  as  the  number  of  processors  increases.  This  is  possible  because  as  more  processors  are 
employed,  smaller  amount  of  data  is  allocated  to  each  processor.  As  a  result,  the  amount  of 
data  to  be  communicated  by  each  processor  during  data  remapping  decreases  as  the  number  of 
processors  increases.  Also,  data  remapping  eliminates  frequent  short  message  communications 
that  occurs  in  parallel  QR  decomposition  algorithms.  Our  experimental  results  show  that  ad¬ 
ditional  reduction  in  the  communication  time  of  data  remapping  is  possible  by  overlapping  the 
communications  with  the  computations  and  scheduling  the  communications. 

We  also  parallelize  weight  computation  steps  of  Element  Space  Pre-Doppler  (ESPrD)  and 
Element  Space  Post-Doppler  (ESPsD)  STAP,  which  are  the  major  computational  bottlenecks  in 
the  entire  processing.  We  develop  efficient  distribution  algorithms  when  a  small  set  of  processors 
distribute  their  data  equally  to  a  large  set  of  processors.  This  happens  frequently  in  the  gen¬ 
eral  real-time  signal  processing  applications,  because  the  number  of  sensors  is  limited  while  the 
number  of  processing  nodes  required  for  real-time  needs  is  much  greater.  Also,  when  exploiting 
task  parallelism,  computationally  intensive  tasks  must  be  distributed  among  many  processors  to 
alleviate  computational  bottleneck.  Simple  communication  scheduling  schemes  will  incur  node 
contention  during  distribution.  Thus,  we  develop  efficient  distribution  algorithms  by  using  data 
remapping  and  processor  groups,  which  will  therefore  reduce  the  eflfect  of  node  contention  signif¬ 
icantly.  Although  we  have  focused  on  two  variations  of  STAP,  the  techniques  of  communication 
scheduling  and  overlapping  computation  with  communication  are  adequately  general.  They  can 
therefore  be  applied  to  several  other  signal  processing  computations  to  yield  efficient  implemen¬ 
tations. 


3.1  Overview 

The  primary  goal  of  a  radar  system  is  to  detect/track  targets  over  a  surveillance  volume.  The 
system  operates  by  transmitting  signals,  receiving  echoed  signals  from  reflecting  objects,  and 
processing  those  signals.  Radar  signal  processing  encompasses  modulation  theory,  detection  and 
estimation  theory,  and  performance  evaluation.  With  recent  improvement  in  computing  technol¬ 
ogy,  attention  has  been  focused  on  building  next  generation  radar  systems.  Such  systems  will 
be  required  to  provide  longer  range  detection  of  increasingly  smaller  targets.  They  also  must  be 
able  to  operate  in  the  presence  of  hostile  electronic  counter  measures  under  unfavorable  environ¬ 
ments.  Thus,  these  systems  require  complicated  signal  processing  techniques  which  demand  more 
computational  power  than  traditional  techniques.  STAP  (Space-Time  Adaptive  Processing)  is 
such  technique.  STAP  refers  to  adaptive  filtering  algorithms  that  utilize  both  spatial  and  tempo¬ 
ral  information  Many  variations  exist  in  STAP  algorithms.  Each  one  has  different  performance 
characteristics  and  different  computational  complexity  [1]. 

Several  special  purpose  architectures  have  been  designed  for  radar  systems.  They  rely  heavily 
on  custom  VLSI  fine  grain  computations.  This  approach  provides  limited  flexibility  and  there¬ 
fore  requires  substantial  redesign  when  new  radar  signal  processing  algorithms  are  developed. 
Recently,  processing  nodes  interconnected  by  a  high  speed  network  become  increasingly  popular 
as  HPC  (High  Performance  Computing)  platforms.  Advances  in  networking  and  HPC  technolo¬ 
gies  lead  to  the  development  of  next  generation  radar  Systems.  Such  systems  are  promising 
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because  of  the  following  characteristics.  They  have  a  scalable  architecture,  Le,  the  network 
interface  and  communication  network  are  designed  to  be  scalable  and  allow  processing  nodes  to 
be  easily  attached.  Their  flexibility  enables  a  wide  range  of  signal  processing  applications  such 
as  radar,  sonar,  acoustic,  and  speech  processing  to  be  performed  on  the  same  platform  by  simply 
replacing  application  software.  They  can  also  eflFectively  cope  with  new  developments  in  signal 
processing  theories. 

In  such  systems,  coarse  grain  computation  is  performed  by  application  software.  The  perfor¬ 
mance  of  such  systems  therefore  greatly  depends  on  the  application  software.  It  must  be  designed 
to  be  scalable  otherwise  the  desired  speed-up  for  the  real-time  response  cannot  be  achieved  due 
to  overwhelming  parallelizing  overheads.  Also,  it  must  be  designed  to  be  portable  so  that  it 
can  be  ported  to  various  systems.  Due  to  MPI  standard  [17],  application  software  developed  on 
general  HPC  platforms  can  be  easily  ported  to  new  systems.  In  this  report,  issues  that  arise 
when  realizing  radar  signal  processing  algorithms  on  general  HPC  platforms,  will  be  addressed. 
It  is  necessary  to  develop  algorithmic  techniques  to  handle  these  issues,  because  current  compil¬ 
ers  cannot  provide  optimized  performance.  The  goal  of  the  this  research  is  providing  a  general 
framework  for  developing  high  speed,  scalable,  portable  radar  signal  processing  algorithms  on 
HPC  platforms.  Although  our  research  will  focus  on  radar  signal  processing,  it  will  contribute  to 
developing  algorithms  for  general  signal  processing  applications  as  well,  since  these  exhibit  com¬ 
mon  computational  characteristics.  We  believe  our  research  is  useful  in  designing  next  generation 
radar  systems. 

The  most  important  requirement  in  radar  systems  is  real-time  response.  Despite  advances  in 
computer  technology,  real-time  response  is  not  an  easily  achievable  task  because  of  the  need  to 
process  larger  problems  {i.e.  surveillance  volume  has  grown  rapidly).  Also,  the  signal  processing 
area  has  evolved  significantly  in  the  last  decade;  new  theories  have  been  developed  in  statistical 
modeling,  estimation  processing  area  has  evolved  significantly  in  the  last  decade;  new  theories 
have  been  developed  in  statistical  modeling,  estimation  methods,  transformation  theory,  among 
others.  These  sophisticated  signal  processing  algorithms  require  more  computation  power.  The 
computational  requirements  of  generic  radar  signal  processing  are  in  the  range  of  GFLOPS.  In 
some  STAP  applications,  processing  400  Giga  words  of  data  necessitates  12  Tera  FLOPS.  We 
believe  that  scalable  algorithms  and  portable  implementations  are  essential  to  take  advantage  of 
HPC  platforms  as  radar  systems.  Algorithmic  techniques  such  as  data  remapping  and  overlapping 
communication  with  computation  must  exploit  the  HPC  architecture,  which  are  significantly 
different  from  earlier  fine  grain  computation. 

Scalability,  flexibility,  programmability,  and  portability  of  HPC  platforms  motivate  developing 
next  generation  radar  systems  for  real-time  applications  such  as  radar  systems.  In  the  real-time 
applications,  the  guaranteed  fast  turn  around  time  is  the  most  important  requirement.  Both 
system  software  and  application  software  should  be  designed  to  meet  this  requirement. 

System  software  must  be  designed  so  that  operating  systems  and  message  passing  libraries  can 
ensure  predictable  computation  and  communication.  Unlike  traditional  real-time  systems  which 
depend  on  custom  VLSI  fine  grain  computation,  coarse  grain  computations  must  be  performed  by 
application  software.  Application  software  therefore  plays  an  important  role  in  these  systems.  It 
should  be  designed  to  be  scalable,  otherwise  excessive  parallelization  overheads  prevent  achieving 
the  desired  speed-up  for  the  real-time  response. 

In  this  report,  we  focus  on  developing  application  software  -  scalable  algorithms  for  radar 
signal  processing  on  general  HPC  platforms.  Emerging  systems  will  retain  minimal  functionality 
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of  general  HPC  platforms.  Additionally,  they  will  have  advanced  hardware  and  software  features 
e,g.  real-time  OS,  real-time  MPI,  and  intelligent  network  interface  to  perform  computations 
and  communications  more  efficiently.  Portability  will  also  be  maintained  on  those  systems.  Our 
developed  algorithms  ported  onto  those  systems  will  therefore  exhibit  more  improved  performance 
by  the  advanced  features  of  those  systems. 

In  general,  a  wide  range  of  signal  processing  applications  such  as  radar,  sonar,  acoustic, 
and  speech  processing  consist  of  a  sequence  of  tasks.  Each  task  possesses  different  computational 
characteristics.  Because  large  amount  of  data  is  to  be  successively  processed  in  real-time,  following 
issues  must  be  addressed. 

•  The  computational  requirement  of  each  task  is  different  from  that  of  other’s.  Therefore, 
particular  task  can  be  a  computational  bottleneck.  Scheduling  the  computations  to  allocate 
the  computational  resources  becomes  important. 

•  The  data  is  accessed  and  processed  in  different  ways  in  each  task.  It  is  difficult  to  find  a 
data  mapping  which  can  be  used  through  entire  sequence  of  tasks.  Even  though  such  data 
mapping  exists,  changing  data  mapping  between  successive  tasks  can  improve  the  overall 
performance.  It  is  necessary  to  develop  efficient  data  movement  algorithms. 

•  Latency  as  well  as  throughput  is  important.  Utilizing  various  types  of  parallelism  in  coop¬ 
eration  with  data  mapping  is  important. 

•  Latency  as  well  as  throughput  is  important.  Utilizing  various  types  of  parallelism  in  coop¬ 
eration  with  data  mapping  is  important.  Combining  task  and  data  parallelism  is  a  possible 
solution  for  small  latency  and  high  throughput.  MIMD  coarse  grain  computation  is  there¬ 
fore  more  suitable  than  SIMD  or  SIMP. 

Current  compiler  technology  cannot  be  used  in  real-time  system  because  of  following  reasons. 
Finding  optimal  data  mapping  is  known  to  be  an  N P-complete  problem.  Compilers  can  offer  only 
limited  way  of  data  mapping  e.p.  block,  cyclic,  and  block-cyclic.  Loop  parallelization  involves 
large  amount  of  unnecessary  data  movement  between  processors  and  increases  communication 
time.  Thus,  in  the  this  research,  we  will  develop  algorithmic  techniques  to  handle  previously 
mentioned  issues.  It  will  provide  more  optimized  performance  than  what  compilers  can.  We 
briefly  list  our  approaches  as  follows. 

Data  Remapping 

In  the  radar  signal  processing,  the  input  data  cube  is  accessed  in  many  ways  as  the  process¬ 
ing  proceeds.  Thus,  the  design  of  initial  data  mapping  as  well  as  data  remapping  between 
tasks  is  needed  to  minimize  the  communication  time.  The  data  mapping  problem  in  gen¬ 
eral  terms  is  to  design  a  distribution  of  data  among  the  processors  such  that  the  overall 
communication  time  is  minimized.  Data  remapping  rearranges  the  data  mapping  between 
the  tasks  of  the  application  problem  to  reduce  the  communication  time. 

Overlapping  Communication  with  Computation 

Current  network  interfaces  and  switches  offer  extremely  low  hardware  latencies;  however, 
the  software  overheads  can  be  in  the  range  of  tens  of  micro  seconds.  It  is  necessary  to 
exploit  latency  hiding  techniques  in  the  algorithm  design  phase.  We  will  exploit  overlapping 
computation  with  communication  to  hide  the  overheads  in  performing  the  communication, 
in  order  to  minimize  the  communication  time  in  data  remapping. 
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Mapping  and  Scheduling  of  Computation  and  Communication 
Most  signal  processing  problems  have  regular  and  known  computation  and  communication 
characteristics.  Exploiting  the  regular  and  known  computational  characteristics  of  signal 
processing  algorithms  on  HPC  platforms  is  needed  to  obtain  real  time  performance.  Be¬ 
cause  the  computing  power  of  the  processors  in  HPC  platforms  are  high,  the  mapping 
and  scheduling  should  exploit  coarse  grain  computations.  To  alleviate  high  communication 
overheads,  communication  should  be  carefully  scheduled. 

To  alleviate  high  communication  overheads,  communication  should  be  carefully  scheduled. 

In  summary,  we  believe  that  new  mapping  and  scheduling  algorithms  are  needed  to  exploit 
HPC  platforms  for  signal  processing  applications.  These  algorithms  must  take  into  account 
the  granularity  of  the  computations,  large  overheads  in  performing  communication,  and  the 
capability  to  overlap  computation  with  communication.  The  goal  of  the  this  research  is  developing 
a  general  framework  to  develop  high  speed,  scalable,  portable  signal  processing  algorithms  on 
HPC  platforms  to  support  real-time  needs. 

In  most  HPC  systems,  lack  of  realistic  computational  models  that  capture  the  architectural 
features  from  a  programmer’s  perspective  is  a  major  bottleneck  in  designing  and  analyzing  high 
performance  systems.  Lack  of  analytical  models  severely  restricts  the  performance  of  algorithms, 
even  though  they  are  portable.  Our  parallel  algorithms  will  be  developed  based  on  our  analytic 
model  and  will  consider  the  overheads  in  using  HPC  platforms.  The  features  of  the  model  will 
be  exploited  in  algorithm  design.  We  will  use  a  well  known  definition  of  scalability.  A  parallel 
algorithm  is  considered  scalable  if  the  execution  time  of  the  algorithm  on  a  machine  with  P 
processors  varies  as  1/P.  Many  HPC  platforms  are  under  development.  Thus,  architecture 
independent  algorithms  and  portability  of  the  developed  code  is  very  important.  Specification  of 
our  algorithms  using  standards  such  as  MPI  will  ensure  portability  of  codes. 

3.1.1  Modeling  High  Performance  Computing  Platforms 

Most  of  the  modern  HPC  platforms  are  coarse  grained  parallel  processing  systems,  having  a 
similar  high  level  architecture.  The  architecture  consists  of  three  main  components  as  shown 
in  Figure  3.1:  (a)  Powerful  computing  nodes^  often  based  on  the  same  processors  as  those  used 
in  current-generation  uniprocessor  workstations,  (b)  A  low  latency  high  bandwidth  network  that 
interconnects  the  computing  nodes.  The  network  typically  consists  of  high  speed  point-to-point 
links  and  routing  switches,  (c)  Network  interfaces  couple  each  processing  node  to  the  network 
links. 

Current  commercial  HPC  platforms  offer  hundreds  of  Giga  FLOPS  of  computing  perfor¬ 
mance.  Examples  of  such  systems  are  the  TMC  CM-5,  IBM  SP-2,  and  Cray  T3D,  among  others. 
These  machines  have  been  successfully  used  for  a  variety  of  high  performance  applications  with 
good  results.  The  chosen  model  of  parallel  computation  should  accurately  depict  the  features 
of  the  machines,  including  the  hardware  features  of  computation  and  communication  units,  the 
method  used  for  routing,  the  technique  employed  for  handling  congestion,  and  mechanisms  for 
synchronization.  In  these  machines,  the  overheads  of  performing  communication  are  still  high 
compared  with  the  speed  of  processors.  These  machines  are  therefore  suitable  for  coarse  grain 
computations. 

For  our  analysis,  we  will  model  general  HPC  platforms  as  a  set  of  high-performance  SISD 
machines  interacting  through  a  high-speed  network.  Interprocessor  communication  is  performed 
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Figure  3.1:  Typical  HPC  platform 


using  explicit  message  passing.  Let  denote  the  startup  time  for  sending  a  message.  Let 
denote  the  transmission  rate  (seconds  per  unit  of  data)  for  data  communication.  The  startup 
time,  including  operating  system  and  communication  protocol  processing  overheads,  is  associated 
with  each  communication  step.  In  current  generation  of  HPC  platforms,  the  network  (hardware) 
latency  is  very  small  as  compared  with  the  software  overheads  in  message  passing. 

We  make  the  following  assumptions  for  our  analysis:  (1)  Sending  a  message  containing  m  units 
of  data  from  a  processor  to  another  processor  or  exchanging  a  message  of  size  m  between  a  pair 
of  processors  takes  Td  +  rnTj,  time.  (2)  Suppose  each  processor  has  m  units  of  data  to  be  routed  to 
a  single  destination  and  the  set  of  all  destinations  is  a  permutation,  then  the  data  can  be  routed 
in  Td  +  rriTd  time.  (3)  To  perform  a  broadcast  of  a  unit  of  data,  a  barrier  synchronization,  or  a 
cooperative  operation  (sum,  min,  max,  prefix  sum  etc.),  2{\ogp)Td  time  is  required.  Broadcasting 
m  units  of  data  takes  \ogp{Td-\-  TUTd)  time.  (4)  To  perform  a  floating  point  operation,  it  takes 
time. 

3.1.2  Scalability 

A  sequential  algorithm  is  usually  evaluated  in  terms  of  its  execution  time,  expressed  as  a  function 
of  the  size  of  its  input.  The  execution  time  of  a  parallel  algorithm  depends  not  only  on  input 
size  but  also  on  the  architecture  of  the  parallel  computers  and  the  number  of  processors.  A 
parallel  system  is  the  combination  of  an  algorithm  and  the  parallel  architecture  on  which  it  is 
implemented.  The  scalability  is  a  measure  of  its  ability  to  achieve  performance  proportional  to 
the  number  of  processors. 

In  the  ideal  case,  the  execution  time  of  a  scalable  parallel  system  should  linearly  decrease 
with  an  increasing  number  of  processors  employed.  However,  in  practice  this  is  not  the  case, 
since  there  is  a  parallelizing  overhead  such  as  communication  time  and  idle  time.  The  purpose 
of  scalability  analysis  is  to  predict  and  measure  the  performance  of  a  parallel  system.  Several 
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metrics  have  been  proposed  for  measuring  the  scalability  of  parallel  systems. 

The  fixed  problem  analysis  allows  us  to  answer  the  question  such  as,  “What  is  the  fastest 
I  can  solve  problem  A  on  computer  X?”  and  “What  is  the  greatest  number  of  processors  I 
can  utilize  if  I  want  to  maintain  an  eflSciency  of  50  percent?”.  Large  parallel  computers  are 
frequently  used  not  only  to  solve  the  fixed  size  problems  faster,  but  also  to  solve  larger  problems. 
This  encourages  a  different  approach  to  the  analysis  of  algorithms  called  scaled  problem  analysis, 
whereby  we  consider  not  how  efficiency  varies  with  the  number  of  processors,  but  how  the  amount 
of  computation  performed  must  scale  with  the  number  of  processors  to  keep  efficiency  constant. 

3.1.3  Portability  and  MPI 

The  message  passing  paradigm  is  the  most  generally  applicable  and  efficient  programming  model 
for  parallel  machines  with  distributed  memory.  It  has  been  used  widely  in  parallel  and  distributed 
computing  systems  for  some  years.  Message  passing  is  attractive  for  portability  and  performance 
since  shared  memory  architectures  can  effectively  execute  message  passing  programs  whereas  the 
reverse  is  not  generally  the  case.  However,  the  development  of  parallel  algorithms  for  applications 
has  been  hindered  by  the  absence  of  a  standard  message  passing  interface.  The  portability  of 
library  routines  is  becoming  increasingly  important  as  codes  are  required  to  run  on  several  vari¬ 
eties  of  parallel  computers.  Portable  libraries  must  also  allow  for  portability  across  programming 
languages.  In  order  to  obtain  maximal  reuse,  the  same  libraries  must  be  usable  from  several  lan¬ 
guages.  Standardization  effort  were  begun  in  1992,  to  define  a  message  passing  interface  which 
would  be  efficiently  implemented  on  a  wide  range  of  parallel  and  distributed  computing  systems. 
The  main  objectives  of  a  message  passing  standard  are  portability  and  ease-of-use.  Also  the 
standard  will  provide  vendors  with  a  clearly  defined  set  of  routines  that  they  could  implement 
efficiently  at  a  low  level,  or  even  provide  opportunities  for  developing  eflScient  hardware.  It  not 
only  provides  portability  and  ease-of-use,  but  also  addresses  to  a  limited  extent  the  issues  of 
program  correctness  and  performance,  because  by  providing  high-level  routines  and  abstractions, 
a  standard  can  reduce  the  likelihood  of  programming  errors,  thereby  enhancing  program  cor¬ 
rectness.  MPI  also  allows  to  develop  parallel  application  libraries.  The  computing  platforms  for 
MPI  include  homogeneous  and  heterogeneous  parallel  and  distributed  systems  and  even  shared 
memory  systems. 

The  basic  content  of  MPI  is  point-to-point  communication  between  pairs  of  processes  and 
collective  communication  within  groups  of  processes.  MPI  also  contains  more  advanced  mes¬ 
sage  passing  features  which  allow  the  user  to  manipulate  process  groups,  provide  topological 
structure  for  process  groups,  and  support  the  development  and  utilization  of  parallel  libraries. 
Every  message  whether  in  point-to-point  or  collective  communication  has  an  associated  data 
type.  The  primitive  data  types  of  the  host  language,  for  example  INTEGER  and  REAL  in  Fortran, 
are  supported.  MPI  also  provides  very  general  facilities  which  can  be  used  to  describe  “derived” 
data  types.  The  data  types  of  MPI  provide  all  the  information  required  for  data  conversion  in 
heterogeneous  environments,  and  do  not  preclude  efficient  implementation  of  MPI  in  homoge¬ 
neous  environments.  They  also  allow  the  users  to  send  and  receive  messages  with  complicated 
storage  patterns  without  the  need  to  copy  data  in  to  and  out  of  message  buffers,  and  allow  an 
implementation  to  optimize  communications  with  such  storage  patterns. 

Point-to-point  and  collective  communications  within  MPI  are  performed  within  process  groups. 
MPI  defines  a  group  as  an  ordered  set  of  process  identifiers,  each  of  which  is  assigned  a  numerical 
rank  within  the  group,  between  zero  and  the  size  of  the  group.  Communications  within  MPI 
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are  also  perforined  within  a  communication  context  which  insulates  messages  in  different  parts 
of  the  program  from  one  another.  The  defining  property  of  a  context  is  that  a  message  sent 
in  one  context  can  only  be  received  in  that  same  context.  The  communication  context  is  the 
primary  mechanism  for  isolation  of  messages  in  different  libraries  and  the  user  program  from 
one  another.  Process  groups  are  user  level  objects  in  MPI  but  communication  context  are  not 
directly  visible.  MPI  bundles  the  process  group  and  communication  context  concepts  into  a 
user  level  object  called  a  communicator  which  provides  communication  services  within  a  unique 
scope.  MPI  defines  an  initial  communicator  MPLCOMM-WORLD  which  has  a  group  containing 
all  processes  of  the  program  and  a  unique  communication  context.  MPI  provides  routines  which 
allow  the  user  to  dynamically  create  new  communicators.  MPI-COMMJDUP  creates  a  duplicate 
of  an  existing  communicator,  i.e.  a  new  communicator  with  the  same  group  of  processes  and  a 
different  communication  context.  This  routine  is  key  to  the  construction  of  robust  communicative 
parallel  libraries.  MPLCOMM.SPLIT  creates  one  or  more  new  communicators  which  contain 
distinct  subgroups  of  an  existing  communicator  and  of  course  a  different  context.  This  routine  is 
key  to  clear  expression  of  task  and  control  parallel  programs.  In  the  simplest  use  of  MPI,  which 
corresponds  to  a  number  of  current  communication  libraries,  MPLCOMM-WORLD  is  the  only 
communicator  used  in  the  program. 

The  point-to-point  message-passing  routines  form  the  core  of  the  MPI  standard,  the  basic 
operations  being  send  and  receive.  They  allow  messages  to  be  sent  between  pairs  of  processes, 
with  message  selectivity  based  explicitly  on  message  tag  and  source  process,  and  implicitly  on 
communication  context.  Each  process  can  execute  its  own  code,  in  MIMD  style,  and  can  be 
sequential  or  multi-threaded.  There  is  no  explicit  support  for  threads,  but  care  has  been  taken 
to  make  MPI  “thread  safe”,  by  avoiding  the  use  of  global  state.  The  send  receive  primitives 
are  provided  in  a  blocking  form  in  which  the  sender  buffer  can  be  reused  immediately  on  return 
from  send  and  the  receiver  buffer  primitive  MPI-RECV.  There  are  four  blocking  send  primitive 
corresponding  to  the  four  communication  modes  in  MPI. 

Standard  The  sender  is  blocked  until  the  send  buffer  can  be  reused  without  altering  the 
message.  The  receiver  is  blocked  until  the  message  has  been  copied  into  the  receive  buffer. 
Since  the  system  is  expected  to  copy  the  message  subject  to  buffer  resources,  the  send- 
recv  pair  does  not  guarantee  synchronization.  This  mode  seems  to  best  represent  common 
practice.  The  send  primitive  is  MPI-SEND. 

Synchronous  The  sender  is  blocked  until  the  receiver  issues  the  corresponding  receive.  No 
system  buffer  is  required  and  the  message  can  be  transferred  without  intermediate  copies, 
at  the  expense  of  synchronization.  The  send  primitive  is  MPLSSEND. 

Ready-receive  The  program  is  in  error  if  the  send  is  issued  before  the  matching  receive 
has  been  issued.  This  allows  a  simple  protocol  where  the  message  is  sent  in  hope  and 
dropped  if  there  is  no  ready  receive,  but  use  demands  special  care.  The  send  primitive  is 
MPLRSEND. 

Buffered  This  mode  allows  the  user  to  control  the  space  available  for  buffering  within  a 
defined  buffer  model,  providing  guaranteed  portability  for  programs  that  demand  message 
buffering.  The  send  primitive  blocks  until  the  message  is  copied  into  the  buffer  space  or  is 
in  error  if  insufficient  buffer  space  was  available.  The  send  primitive  is  MPI-BSEND. 

MPI  also  provides  primitives  of  the  non-blocking,  or  immediate  return,  form,  MPU?SEND  and 
MPIJ7RECV,  in  which  the  message  buffer  must  not  be  used  until  the  communication  has  com- 
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pleted,  similar  to  the  immediate  routines  in  NX/2.  There  is  a  small  but  comprehensive  set  of 
routines  to  test  and  wait  for  completion  of  non-blocking  functions.  This  functionality  allows  the 
user  to  write  programs  which  do  not  incur  the  overhead  of  copying  message  data  into  intermediate 
buffers. 

Collective  communications  are  provided  where  all  processes  in  a  process  group  are  involved  in 
a  collective  operation.  A  collective  function  is  called  as  if  it  contained  a  group  synchronization, 
although  this  property  is  not  mandated  since  efficient  implementations  may  not  synchronize.  The 
following  is  description  of  some  important  collective  communication  routines. 

MPIJBARRIER  Synchronization  of  all  processes  within  a  group. 

MPIJ3CAST  Every  process  within  a  group  receives  data  broadcast  by  a  root  process. 

MPI^GATHER  Every  process  within  a  group  sends  data  to  a  root  which  stores  the  data 
in  rank  order. 

MPI_SCATTER  The  inverse  of  MPLGATHER,  where  a  root  process  sends  sections  of 
data  to  every  process  within  a  group  in  rank  order. 

MPIJREDUCE  Performs  a  parallel  reduction  over  every  process  within  a  group.  The 
operation  is  selected  from  a  set  of  defined  arithmetic  and  logical  operators  or  is  described 
as  a  user  function.  The  output  is  available  to  a  root  process,  every  process,  or  scattered 
over  the  processes. 

MPI  also  contains  collective  routines  for  all-to-all  global  communication,  all-to-all  personal  com¬ 
munication  also  known  as  complete  exchange,  and  inclusive  parallel  prefix  also  known  as  scan. 

3.2  HOPD  STAP  and  its  Computational  Requirements 

The  first  step  of  HOPD  STAP  is  Doppler  processing  which  transforms  a  data  cube  into  the 
Doppler  domain;  M-point  FFT  computations  are  performed  along  delay  dimension  for  each  ele¬ 
ment  and  each  range-gate.  Therefore,  a  total  of  NL  M-point  FFT  computations  are  performed. 
The  next  step  is  weight  computation.  The  transformed  data  is  converted  into  least  square  prob¬ 
lems.  There  are  (M-2)  least  square  problems  to  be  solved.  The  transformed  data  cube  is  divided 
into  the  M  slabs  along  the  range-gate  dimension,  which  are  referred  to  as  Doppler  bins  as  shown 
in  Figure  3.2.  The  underlying  matrix  of  each  least  square  problem,  is  obtained  by  concatenating 
the  three  L  X  N  size  consecutive  Doppler  bins  i  —  and  i  -h  1  in  an  overlapped  fashion.  The 
resulting  matrix  has  a  size  of  i  x  3A.  Each  matrix  is  divided  into  two  parts.  The  first  part  of  the 
size  Lis  X  3N  defines  the  least  square  problem  that  is  solved  by  QR  decomposition.  The  solution 
of  the  least  square  problem  z.e.  the  weight  vector  is  applied  to  the  remaining  part  {L^a)  In  the 
third  step.  Lis  will  typically  vary  from  2{3N)  =  6N  to  5(3A)  =  15N  [11]. 

Figure  3.2(a)  shows  the  typical  value  of  each  dimension  of  a  data  cube  (delay  dimension 
M=64,  element  dimension  A=64,  and  range-gate  dimension  L=4096  (iy/5=1024)  for  a  nominal 
size  problem.  All  these  parameters  can  change  for  a  bigger  size  problem).  In  the  Doppler 
processing  stage,  NL  M-point  FFT  computations  are  performed.  In  the  weight  computation 
stage,  there  are  (M-2)  QR  decomposition  computations  of  size  Lis  X  3N,  It  is  noted  that  only 
Lis=lQN {=1024)  parts  of  range-gates  are  used  for  the  weight  computation,  as  shown  in  Figure 
3.2(b).  The  weight  application  and  detection  stages  have  also  many  independent  problems,  but 
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the  amount  of  computations  is  comparable  to  that  of  Doppler  processing  stage.  The  sequential 
m-point  FFT  problem  requires  mlogm  complex  multiplications  and  additions.  Assuming  one 
complex  multiplication  and  addition  operation  take  5  flops*,  the  sequential  m-point  FFT  takes 
bm  log  m flops.  The  sequential  QR  decomposition  of  a  m  x  n  matrix  needs  n^(m  -  n/3)  flops 
[14].  Therefore,  the  estimated  computational  requirement  of  HOPD  STAP  is  as  follows. 

•  Doppler  Processing:  NL  M-point  FFT  Problems 

{NL)  X  5  X  MlogM  =  480  x  2'^° flops  =  iSOM flops 

•  Weight  Computation:  (M-2)  QR  Decomposition  Problems  of  size  16N  x  3N 

(M  -  2)  X  {{3N)^  X  (16A^  -  3N/3)}  =  2mM flops  »  2G flops 

The  entire  HOPD  STAP  requires  approximately  SG flops.  The  real  time  requirements  vary  by 
the  application.  Suppose  that  the  entire  set  of  steps  is  to  be  computed  in  less  than  1  second,  then 
at  least  54  POWER2  processors  (operating  at  66MHz)  are  needed  for  the  real  time  computation 
of  HPOD  STAP  in  an  ideal  case. 


3.3  Scalable  Parallel  Algorithms  for  HOPD  STAP 

In  this  section,  we  first  discuss  the  characteristics  of  general  HPC  platforms  for  algorithm  devel¬ 
opment  and  scalability  analysis.  Then,  we  develop  techniques  for  data  remapping,  scheduling  and 
overlapping  communication  with  computation  to  obtain  scalable  parallel  algorithms  for  HOPD 
STAP. 

3.3.1  Computational  Model 

For  our  analysis,  we  will  model  the  general  HPC  platforms  as  a  set  of  high-performance  SISD 
machines  interacting  through  a  high-speed  network.  Assuming  MPI  is  used,  interprocessor  com¬ 
munication  is  performed  using  explicit  message  passing  [8].  Let  Tj,  denote  the  startup  time  for 
sending  a  message.  Let  denote  the  transmission  rate  (seconds  per  unit  of  data)  for  data 
communication.  The  startup  time,  including  operating  system  and  communication  protocol  pro¬ 
cessing  overheads,  is  associated  with  each  communication  step.  In  current  generation  of  HPC 
platforms,  the  network  (hardware)  latency  is  very  small  as  compared  with  the  software  overheads 
in  message  passing. 

We  make  the  following  assumptions  for  our  analysis:  (1)  Sending  a  message  containing  m  units 
of  data  from  a  processor  to  another  processor  or  exchanging  a  message  of  size  m  between  a  pair 
of  processors  takes  Td  +  rnTj,  time.  (2)  Suppose  each  processor  has  m  units  of  data  to  be  routed  to 
a  single  destination  and  the  set  of  all  destinations  is  a  permutation,  then  the  data  can  be  routed 
in  Td  -b  ruTd  time.  (3)  To  perform  a  broadcast  of  a  unit  of  data,  a  barrier  synchronization,  or  a 
cooperative  operation  (sum,  min,  max,  prefix  sum  etc.),  2{\ogp)Td  time  is  required.  Broadcasting 
m  units  of  data  takes  logp(Td-\-  TtiTd)  time.  (4)  To  perform  a  floating  point  operation,  it  takes  tc 
time. 

For  currently  available  HPC  platforms,  the  ratio  of  Td  to  td  is  in  the  range  of  several  hundreds 
to  few  thousands  as  shown  Table  3.1^.  To  reduce  the  communication  time,  the  high  startup  cost 
as  well  as  the  message  length,  must  be  considered. 

•C.B.  Moler’s  concept  of  a  flop  is  used.  One  flop  roughly  constitutes  the  effort  of  doing  a  floating  add,  a 
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Table  3.1:  Communication  features  of  various  HPC  platforms. 


Machine 

Tdifisec) 

Td{p-sec/byte) 

Tdjrd 

T3D 

93 

0.043 

2163 

SP-2 

46 

0.035 

1314 

CM-5 

86 

0.12 

716 

Paragon 

82 

0.26 

315 

iPSC/860 

60 

0.50 

120 

iPSC/2 

700 

0.36 

1944 

3.3.2  Scalability  and  Portability 

We  use  a  standard  definition  of  scalability  and  efficiency  [13,  15].  A  parallel  algorithm  is  con¬ 
sidered  scalable  if  the  execution  time  of  the  algorithm  on  a  machine  with  p  processors  varies  as 
i.  Efficiency  is  defined  as  where  Ts  is  the  sequential  execution  time  and  Tp  is  the  parallel 
execution  time.  Our  goal  is  to  design  scalable  algorithms  which  provide  high-speed  execution  on 
available  partition  sizes  of  machines  for  problem  sizes  useful  to  radar  signal  processing  commu¬ 
nity.  In  our  analysis,  we  consider  machine  sizes  that  are  not  “large”,  for  example,  at  this  time, 
the  largest  installed  accessible  coarse  grain  machines  have  less  than  IK  nodes. 

Our  implementations  were  performed  using  the  MPI  standard,  and  therefore  can  be  easily 
ported  to  other  High  Performance  Computing  (HPC)  platforms.  Our  choice  of  MPI  allows  us 
to  access  many  communication  features  provided  by  emerging  HPC  platforms.  At  this  time,  we 
have  implemented  our  algorithms  on  SP-2  and  T3D. 

3.3.3  Data  Remapping,  Scheduling,  and  Overlapping  Computation  with  Com¬ 
munication 

For  a  scalable  performance,  efficient  data  mapping  and  data  distribution  schemes  should  be 
developed  so  that  the  parallelizing  overhead  is  minimal;  these  schemes  should  minimize  the  inter- 
processor  communication  time.  Our  algorithm  exploits  these  schemes  and  is  scalable  in  the  range 
of  p  <  M. 

A.  Data  Remapping 

We  choose  data  mapping  schemes  which  allow  a  processor  to  contain  all  the  necessary  data  for 
computing  each  task.  In  our  approach,  tasks  such  as  FFT  and  QR  decomposition  do  not  need 
interprocessor  communication.  In  the  Doppler  processing  stage,  the  data  cube  is  accessed  along 
the  delay  dimension.  However,  in  the  weight  computation  stage  which  follows  Doppler  processing 
stage,  the  data  cube  is  accessed  along  the  range-gate  dimension  which  is  orthogonal  to  the  delay 
dimension.  Thus,  different  data  mapping  schemes  are  necessary  for  these  computation  steps. 
The  remapping  process  refers  to  the  transformation  of  the  initial  data  mapping  into  another  data 

floating  point  multiply,  and  a  little  subscripting  [14]. 

’Some  results  were  obtained  by  our  own  experiments.  Others  are  from  Techmcal  Report  No.  UCB/CSD92 
#675,  University  of  California,  Berkeley.  It  is  assumed  that  MPI  is  used  for  T3D. 
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mapping.  Our  scalability  analysis  in  the  following  section  shows  that  our  approach  using  data 
remapping  exhibits  scalable  performance  characteristics,  which  allows  each  processor  to  compute 
sequentially  as  much  as  possible  without  interprocessor  communication  by  providing  the  data 
required  in  its  computation. 

Initially,  the  data  cube  is  partitioned  along  the  range-gate  dimension  into  blocks  containing 
Lis/p  snapshots  and  the  processors  are  assigned  to  these  partitioned  data  in  a  cyclic  distribution 
as  shown  in  Figure  3.3(a).  We  call  this  data  mapping  as  snopshot  Tnupping.  Each  processor  has 
'iLi.MN  ALjJl  jVf.point  FFTs  can  be  performed  in  each  processor  without  interprocessor 

cornmunication.  The  other  mapping  is  shown  in  Figure  3.3(b),  which  we  name  as  slab  mapping. 
Here,  the  Lis  part  of  slabs  (Doppler  bins)  is  partitioned  along  the  delay  dimension  among  the 
p  processors.  Note  that  only  Lis  part  of  the  data  cube  is  used  for  weight  computation  and  the 
rest  {Lwa  part)  is  used  for  weight  application.  The  Ly^a  part  of  the  data  cube  is  not  remapped. 
The  size  of  the  data  which  will  be  remapped  in  each  processor,  is  LuMK^  Each  processor  sends 

size  data  to  all  the  processors  and  receives  size  data  from  all  the  other  processors 

as  shown  Figure  3.3(c). 

Each  processor  can  form  (M/p  -  2)  matrices  of  size  Lis  X  3N'  for  computing  the  weight  vectors. 
Therefore,  a  total  of  {Mjp-^)  QR  decomposition  problems  can  be  solved  without  communication 
and  two  more  problems  can  be  solved  by  communicating  with  adjacent  processors.  Exchange  of 
boundary  data  between  the  adjacent  processors  is  needed.  This  is  shown  in  Figure  3.4.  This  data 
exchange  is  performed  during  the  remapping  process. 

B.  Scheduling 

As  mentioned  before,  HOPD  STAP  consists  of  a  sequence  of  tasks.  The  input  to  a  task  is  output 
of  the  previous  task.  Due  to  the  data  dependency  and  different  data  access  patterns  between 
tasks,  interprocessor  communication  is  required.  For  scalable  performance,  a  good  scheduling 
scheme  is  necessary.  The  QR  decomposition  tasks  are  dependent  on  only  Lis  part  of  the  data 
cube  which  were  Doppler  processed.  The  output  of  QR  decomposition  tasks  are  used  along  with 
the  rest  of  the  Doppler  processed  data  for  detection. 

Also,  the  scheduling  of  communication  is  important  to  fully  exploit  the  available  communi¬ 
cation  bandwidth  and  to  avoid  node  contention.  During  the  remapping  process,  every  processor 
distributes  equal  size  data  to  all  the  other  processors.  It  is  a  collective  communication  but  it  can 
be  converted  into  (p-1)  steps  of  all-to-all  personalized  communications  {i.e.  permutations)  by  a 
simple  scheduling.  In  each  of  the  (p-1)  communication  steps,  each  processor  sends  equal  size  data 
to  another  processor  and  receives  equal  size  data  from  another  processor.  This  exploits  the  fact 
that  total  p  point-to-point  communication  can  be  performed  in  parallel  among  p  processors. 

By  employing  a  butterfly-like  communication  pattern,  the  number  of  communication  steps 
can  be  reduced  to  logp  and  the  startup  overhead  can  be  reduced  by  a  factor  of  By  using 

features  such  as  communication  group  and  context  in  MPI,  the  communication  between  processors 
can  be  localized.  However,  in  this  case,  there  is  an  additional  overhead  in  buffer  copying. 

C.  Overlapping  Communication  with  Computation 

We  reduce  the  communication  time  of  data  remapping  by  scheduling  the  communications  and  by 
overlapping  the  communication  with  the  computation.  Using  asynchronous  non-blocking  com¬ 
munication  primitives  provided  by  the  MPI  standard  (such  as  MPI_Isend(),  MPI_Irecv(),  and 
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(a)  Snapshot  Mapping 


(b)  Slab  Mapping 


(c)  Data  to  be  communicated 


Figure  3.3:  Data  remapping  process. 


Ill 


Figure  3.4:  Boundary  data  exchange. 

MPI-WaitallO),  we  can  overlap  the  communication  needed  for  the  remapping  process  with  the 
FFT  computations.  First,  each  processor  performs  sequential  FFT  computations  over  Lis  portion 
of  the  data  cube.  As  soon  as  these  M-point  FFT  computations  are  computed,  the  remapping 
process  can  start.  Asynchronous  non-blocking  commands  MPI_Isend()  and  MPI_Irecv()  return 
immediately  after  initiating  communication  without  waiting  for  completion.  Therefore,  the  rest 
of  the  FFT  computations  over  L^a  part  of  the  data  cube  can  be  performed  by  processors,  while 
the  actual  data  transfer  is  handled  by  communication  coprocessors. 

A  processor  can  solve  one  QR  decomposition  problem,  after  receiving  three  consecutive  slabs 
from  the  remapping  process.  Suppose  the  data  cube  is  partitioned  along  the  delay  dimension  by 
K  such  that  ^  >  3.  The  remapping  process  is  divided  into  K  (p-1)  permutation  steps.  Then, 
the  computation  of  QR  decomposition  can  be  also  overlapped  with  the  communication. 

D.  Scalability  analysis 

For  scalability  analysis,  jfe  assume  that  communication  is  not  overlapped  with  computation.  The 
experimental  results  of  overlapping  the  remapping  process  with  the  computation  are  shown  in 
Section  3.6.  The  remapping  process  consists  of  (p-1)  permutations.  During  each  permutation, 
every  processor  sends  and  receives  size  data.  Therefore,  the  communication  time  for  the 

remapping  process  is 

Tremap  =  {P  -  ^) {Td  +  k  ■  ■  Td}  (3.1) 

where  k  is  the  size  of  a  data  point  in  bytes.  Boundary  data  exchanges  involve  two  additional 
communications  with  message  size  of  k  •  LisN  •  r^.  The  communication  time  for  boundary  data 
exchange  becomes  2{Td  +  k  ■  LisN  •  tj,}-  Also,  for  the  weight  application,  the  solutions  of  QR 


112 


decompositions,  i.e.  the  weight  vectors  must  be  permuted  among  all  the  processors.  This  involves 
(p  -  1)  all-to-all  communications  with  message  size  of  •  y.  It  takes  (p  -  l){rd  +  /?  •  y  • 

Td}  communication  time.  Because  L^N  <  and  y  <  ^y—  when  p  <  M,  the  total 

communication  time  of  our  approach  is  upper  bounded  by 

Tcomm  <  2p-Td  +  k-  .  Td  (3.2) 

p 

It  is  noted  that  2p-Td<.k  ■  ■  Td,  for  the  values  taken  by  p  (<  M),  Lu  (=  L/4),  M,  and 

N  in  typical  STAP  applications.  For  example,  M=64,  N=64,  Tis=1024,  and  p=32,  the  startup 
overhead  2p  •  Td=2.9  msec  which  is  much  smaller  than  k  •  .  7^=110  msec  for  the  SP-2. 

Therefore,  Tcomm  can  be  approximated  as 


Tcomm 


,  3iMA 

w  k - - - Td 

4p 


(3.3) 


In  summary,  the  total  startup  overheads  are  2p-Td  and  total  amount  of  data  to  be  communicated 
is  approximately  k  •  •  Td  when  data  remapping  is  used.  If  data  remapping  is  not  used,  then 

6(M-  2)N  communication  steps  are  required  to  solve  (M-2)  X  3N  size  QR  decompositions. 
(Solving  one  m  x  n  size  QR  decomposition  requires  2n  communication  steps.)  In  this  case,  the 
startup  overhead  will  be  6(M  —  2)N  •  Td-  Thus,  remapping  which  eliminates  frequent  short 
message  communications,  reduces  startup  overheads.  Also,  it  reduces  total  amount  of  message 
to  be  communicated.  This  will  be  analyzed  in  detail  in  the  Section  3.5. 

The  sequential  execution  time  Tg  consists  of  time  for  executing  LN  M-point  FFTs,  (M-2) 
Lis  X  3N  size  QR  decompositions,  weight  application,  and  detection.  Weight  application  and 
the  rest  of  the  computations  including  detection  are  not  computationally  demanding.  Thus,  Ts 
becomes 


Ts  =  {9{M  -2)N^{L/4  +  N)  +  10LMN\ogM}  -tc  (3.4) 

Due  to  the  remapping  of  data,  the  load  on  the  processors  is  well  balanced.  In  other  words, 
the  FFT  and  QR  decomposition  computations  are  equally  distributed  among  the  processors. 
Therefore,  the  parallel  execution  time  is 

T 

Tp  =  —  +  Tcomm 

P 

=  {9{M  -2)N^{L/4  +  N)  +  lOLMN\ogM}  ‘-^-k  -  Td  (3.5) 

p  Ap 

It  should  be  noted  that  Tcomm  ^  In  other  words,  the  computation  time  (^)  of  each  processor 
is  much  larger  than  the  communication  time  {Tcomm) ^  Thus,  the  proposed  algorithm  is  scalable 
in  the  sense  that  the  execution  time  of  the  algorithm  on  a  machine  with  p  processors  varies  as  ^ 
in  the  range  of  p  <  M.  Efficiency  becomes 

T  T  1 

E  =  -^=  .r  - r  = - T -  (3*6) 

P-Tp  p{^  + Tcomm)  1+%^ 

Because  Tcomm  ^  near  ideal  efficiency  is  achieved. 
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3.4  Comparison  with  previous  approaches 

A  previous  approach  for  HOPD  processing  uses  row  wrap  mapping  scheme  in  order  to  avoid  data 
redistribution  (remapping)  [11].  In  their  approach,  FFT  computations  can  be  performed  sequen¬ 
tially  without  inter-processor  communication.  However,  parallel  QR  decomposition  is  needed  to 
exploit  their  row  wrap  mapping  scheme.  The  parallel  QR  decomposition  algorithm  developed  by 
Whitman  et  al.  [12]  was  used  in  [11].  In  [11],  QR  decomposition  comprises  most  of  the  computa¬ 
tion  as  well  as  communication  time.  Their  experimental  results  show  that  its  communication  time 
increases  rapidly  as  the  number  of  processors  is  increased.  When  a  32  node  SP-1  is  employed, 
the  percentage  of  communication  time  is  more  than  50%  of  the  total  execution  time. 

The  parallel  QR  decomposition  algorithm  of  [12]  and  its  timing  diagram  for  the  first  two  steps 
are  shown  in  Figure  3.5  and  Figure  3.6.  Processors  are  idle  between  PASS  and  SCALE  steps  and 
the  processors  communicate  during  PASS  and  BRDCAST  steps.  The  communication  pattern 
is  collective  for  PASS  and  is  broadcast  for  BRDCAST.  PEO  (which  contains  the  main  diagonal 
element  an),  can  start  SCALE  step  only  after  it  has  received  the  accumulated  dot  products  from 
the  other  PE’s.  During  the  SCALE  step,  all  the  processors  except  PEO  remain  idle.  Also,  after 
receiving  the  broadcast  scale  factors  from  PEO,  all  the  processors  can  perform  UPDATE.  The 
estimated  computation  and  the  communication  times  are 

tcomp  =  {{n^ +  2n)-  — +  +  hn) -tc  (3.7) 

"P 

71^ 

tcomm  =  (p  +  logp-l){ra-rd-t-fc  -(-^-|- Y)-rd}  (3.8) 

where  m  and  n  {m  >  n)  represent  the  row  and  column  size  of  the  matrix,  tc  is  the  time  for 
performing  one  floating  point  operation,  and  Td  and  are  startup  time  and  per-word  transfer 
time,  respectively.  From  Equation  (3.8),  the  communication  time  increases  in  proportion  top,  the 
number  of  processors.  There  is  another  parallelizing  overhead:  there  exists  sequential  bottleneck 
where  only  one  processor  is  active  and  all  the  others  are  idling.  The  idle  time  of  SCALE  step  is 
estimated  in  Equation  (3.9)  and  during  this  time,  {p-l)tidie  useful  work  could  be  done  otherwise. 

tidie  ^  —(n‘^-{-7n)'tc  (3-9) 

To  solve  one  QR  decomposition  problem,  there  are  n  communication  steps  and  each  step  i 
involves  collect  and  broadcast  communications  using  message  of  size  (n-i-l-1),  where  n  represents 
the  column  size  of  the  matrix  involved.  Therefore,  a  total  of  2(n-z  +  l)  =  n{n+l)  messages 
are  communicated.  Because  there  are  {M-‘l)  QR  decomposition  problems  of  size  Lis  X  ZN  in  the 
HOPD  STAP,  the  total  size  of  the  messages  to  be  communicated  is  ZN{ZN  -\-  1)(M  —  2).  This 
is  almost  \-th  of  the  entire  data  cube  size  we  are  considering.  By  assuming  that  all  these  data 
can  be  sent  in  two  steps  [i.e.  collect  and  broadcast  communications)  by  ignoring  communication 
startup  overheads  and  idle  times,  a  lower  bound  on  the  communication  time  of  the  algorithm  in 
[3]  can  be  obtained.  Using  the  time  for  point-to-point  communication  time  as  {Td  +  m-  Td)  and 
the  time  for  broadcasting  as  (T^  -f  m  •  Td)  logp,  the  communication  time  '^comm  in  [11]  is  lower 
bounded  as 

Zomn.  >  {p  +  logp-l){Td  +  k-^N{ZN+l)iM-2)-Td}  (3.10) 

«  {p  +  log p){k^MN^)  ■  Td  (3-11) 
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The  i-th  step  of  the  parallel  QR  decomposition  based  on  House¬ 
holder  transformation  (i  =  where  n  is  the  column  size  of 

the  underlying  matrix).: 


=  a^  +  y^e^,  y.  =  sign  {a ^ 


/  2  2 


“  K+T,)  -X’  ■' 

(1)  DOTPRD[/]:  Compute  the  dot  products  of  the  local  block  of 

column  i  with  columns  Only  use  elements  that  are  in 

matrix  rows  i  through  m. 

(2)  ACCUM[i]:  Wait  for  the  n-i+l  dot  products  from  a  successor 
processor,  if  there  is  one,  and  add  on  the  local  dot  products. 

(3)  PASS[i]:  Send  the  revised  n-i+l  dot  products  to  a  preceding 
processor,  if  there  is  one. 

(4)  SCALE[i]:  The  processor  which  contains  an  is  the  last  to 
receive  the  revised  dot  products.  It  performs  a  square  root  opera¬ 
tion  to  determine  the  scale  factor  g^-,  adds  it  to  the  main  diagonal 
element,  corrects  the  inner  products  for  the  addition  of  g,-  to  the  i- 
th  element  of  Wp  and  multiplies  them  by  pn 

(5)  BRDCAST[/]:  Wait  for  the  n-i+l  scale  factors  from  a  preced¬ 
ing  processor,  if  there  is  one,  and  pass  them  on  to  a  successor 
processor,  if  there  is  one. 

(6)  UPDATE[/]:  Update  the  local  block  of  column  i  through  n  by 
subtracting  multiples  of  the  local  block  of  column  i. 


A^ll-n.u.u. 
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PEO  PEI  PE2  PE3 


PEO  PEI  PE2  PE3 


(4PEs  compute) 


•  000 

( 1  PE  computes) 


•  •  •  • 

(4PEs  compute) 


•  000 

(1  PE  computes) 


Computation  ■  DOTPRD  ■  ACCUM,  UPDATE  ■  SCALE 

Communication  ■  PASS,  BRDCAST 
Idle  □ 


Figure  3.6:  The  timing  diagram  of  the  parallel  QR  decomposition  algorithm  for  the  first  two 
steps.  Four  processors  are  assumed  for  the  sake  of  illustration. 
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In  other  words,  actual  communication  time  of  [11]  will  take  at  least  (p  +  logp)(k^MN'^)  •  r^. 
Consider  the  case  of  M=64,  N=64,  Lis=1024,  k=8,  and  p=S2,  then  the  communication  time 
in  the  our  algorithm  is  Tcomm  ^  S84Kword  ■  and  the  communication  time  in  [11]  becomes 
'^'comm  ^  40000 A'word  •  This  explains  why  the  efficiency  of  the  algorithm  in  [11]  drops 
drastically  as  the  number  of  processors  is  increased.  It  should  be  also  noted  that  there  is  significant 
processor  idle  time  as  shown  in  Equation  (3.9).  In  contrast,  the  communication  time  of  the 
proposed  algorithm  is  inversely  proportional  to  p  as  shown  in  Equation  (3.3).  If  more  processors 
are  used  in  the  range  p  <  M,  the  communication  time  becomes  less.  Figure  3.7  shows  the 
estimated  communication  time  of  the  proposed  algorithm  and  the  communication  time  in  [11] 
for  the  SP-2.  As  shown  in  Figure  3.7,  Tcomm  <  T^omm-  also  observed  that  Tcomm  decreases 
while  T^omm  increases  with  increasing  number  of  processors. 

ScaLAPACK  [7]  also  provides  QR  decomposition  routine.  It  uses  BLAS (Basic  Linear  Algebra 
Subroutine),  PBLAS  (Parallel  extension  of  BLAS),  and  BLACS (Basic  Linear  Algebra  Commu¬ 
nication  Subprogram).  Because  ScaLAPACK  employs  block  cyclic  data  mapping,  data  redis¬ 
tribution  is  necessary  between  EFT  computations  and  QR  decompositions.  The  efficiency  of 
ScaLAPACK  drops  significantly  unless  large  size  matrices  are  used.  For  example,  it  can  deliver 
about  sustained  21  GFLOPS  on  16  X  32  node  Paragon  for  the  matrix  size  of  35000  X  35000. 
However,  if  the  matrix  size  becomes  less  than  1000  x  1000,  then  less  than  1  GFLOPS  can  be 
achieved  on  Paragon.  This  means  that  the  parallelization  overheads,  i.e.,  communication  and 
idle  times  dominate  the  computation  time.  Because  the  typical  matrix  size  for  HOPD  STAP  is 
IQN  X  5N  (1024  X  320  when  ^=64),  we  cannot  expect  high  efficiency  by  using  ScaLAPACK. 


3.5  Experimental  Results 

We  have  performed  our  implementations  using  C  and  MPI^  on  the  SP-2  at  Maui  High  Perfor¬ 
mance  Computing  Center  and  on  the  T3D  at  the  Pittsburgh  Supercomputing  Center.  Optimized 
versions  of  code  provided  by  ESSL  (Engineering  Scientific  Subroutine  Library)  were  used  in  our 
SP-2  implementations. 

Remapping  process  can  be  performed  either  by  using  asynchronous  non-blocking  communi¬ 
cation  or  by  using  synchronous  blocking  communication  as  shown  in  Figure  3.8.  Asynchronous 
non-blocking  communication  allows  the  communication  in  the  remapping  process  to  be  overlapped 
with  Lwa  parts  of  FFT  computations.  In  this  case,  the  actual  time  spent  in  communication  ac¬ 
tivity  becomes  tis{i)+tir{i)  +  twt{i)^  where  tis{i),  and  are  time  spent  in  the  i-th 

MPI-Isend,  MPI_Irecv,  and  MPI.Waitall,  respectively.  In  the  case  of  non-blocking  communica¬ 
tion,  communication  time  becomes  Gr(0>  where  Gr(*)  represents  time  that  takes  in  the  ?-th 
MPI^endrecv.  Table  3.2  shows  the  time  results  of  FFT  and  data  remapping  process.  It  should 
be  noted  that  message  size  to  be  communicated  decreases  as  the  number  of  processors  increases. 
Thus,  the  communication  time  of  remapping  process  decreases  as  the  number  of  processors  in¬ 
creases  up  to  M.  Figure  3.9  illustrates  the  reduction  in  the  remapping  time  by  overlapping  the 
remapping  process  with  the  computation.  On  both  machines,  the  overlapped  remapping  process 
takes  less  than  37%  of  the  non-overlapped  one. 

SP-2  was  configured  in  the  dedicated  mode  to  obtain  the  results  of  the  HOPD  STAP  imple¬ 
mentation  shown  in  Table  3.3.  For  the  sake  of  comparison,  an  earlier  implementation  [11]  of  the 
same  technique  using  the  same  parameters  runs  in  about  30  seconds  on  a  32  node  SP-1. 

*MPICH  version  was  used. 


117 


Figure  3.7:  Estimated  Toomm  and  on  SP-2  for  M=64,  iV=64,  L;,=1024.  Each  data  is 

assumed  as  single  precision  complex  number  (8  bytes). 
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(LjgN/p)  M-point  FFT's 

[L^gN/p^)  M-point  FFT's 

for  (i=l;  i<p;  ++i)  { 

.  .  .  _ 

hsd) 

hrd) 

MPI_Isend(i)  ;  ^ 

} 

MPI_Irecv(i)  ;  ^ 

M-poinJj.  FFT' 

S 

MPI  Waitall ( i )  ;  ^ 

twtd) 

} 

QR 

Decompositions 

(a)  Asynchronous  Nonblocking  communication 


{LigN/p)  M-point  FFT's 
(L^gN/p)  M-point  FFT's 
for  (i=l;  i<p;  ++i)  { 

MPI_Sendrecv(i) ; 

tsrd) 

} 

QR  Decompositions 

(b)  Synchronous  Blocking  communication 

Figure  3.8:  Asynchronous  Nonblocking  communication  v.s.  Synchronous  Blocking  communica¬ 
tion  for  Remapping. 
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Table  3.2:  Timing  Results  for  computing  FFT  and  Data  Remapping  on  T3D  and  SP2. 


Timing  Results  for  Performing  FFT&  REMAPPING  on  T3D  (msec)® 


p»4 

p=8 

p=:=32- 

iiilHii 

Doppler 

Number  of  64-point  FTFT’s  per  PE 

65,536 

32,768 

16,384 

8,192 

4,096 

Proc 

Total  time  for  FFTs 

12800.0 

6420.0 

3210.0 

1610.0 

200.0 

Message  size 

2MB 

512KB 

3i28KB 

32KB 

SICB 

Data 

(p-1)  Async  Nonblocking  pt-to-pt  communications 

63.5 

37.6 

21.8 

15.1 

11.8 

Remap 

(/7-1)  Synchronous  blocking  sendrecv  communications 

264.0 

154.0 

84.5 

48.1  1 

32.0 

a.  Cray  Library  was  not  available.  Numerical  Recipies  routine  was  used. 


Timing  Results  for  Performing  FFT&  REMAPPING  on  SP2  (msec)>> 


Number  of  PBS 

piSi4  ■ 

iiip-16 

pa32 

p«64 

Doppler 

Proc 

Number  of  64-point  FFT’s  per  PE 

65,536 

32,768 

16,384 

8,192 

Total  time  for  FFTs 

1730.0 

858.0 

428.0 

214.0 

Message  size 

f  '2MB 

||ll28KB 

•  32KB 

8KB 

Data 

Remap 

(p-\)  Async  Nonblocking  pt-to-pt  communications 

64.7 

37.5 

13.2 

8.57 

6.71 

(pA)  Synchronous  blocking  sendrecv  communications 

351.0 

220.0 

101.09 

43.99 

29.8 

b.  IBM  ESSL  library  was  used  for  computing  FFT.  MPICH  was  used.  Loadleveller  data. 


Table  3.3:  Total  Execution  Times  on  SP-2  for  M=N=64,  L— 4096. 


No.  of  PE’s 

p  =  1 

p  =  4 

p  =  8 

p  =  16 

p  =  32 

p  =  64 

SP-2 

55sec 

15  sec 

7.8  sec 

3.9  sec 

2.0  sec 

0.96  sec 
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Figure  3.9:  Experimental  results  of  data  remapping  with  overlapping  for  M=64,  A^=64,  L/j=1024 
on  SP-2  and  T3D.  Each  data  is  assumed  as  a  single  precision  complex  number  (8  bytes). 
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3.6  A  Methodology  for  the  Design  of  Scalable  Solutions 

Radar  signal  processing  is  typical  of  the  applications  that  arise  in  embedded  real-time  environ¬ 
ments.  In  this  section,  we  highlight  a  coarse-grained  parallel  approach  to  Space  Time  Adaptive 
Processing  (STAP),  a  class  of  radar  signal  processing.  We  first  describe  a  model  of  the  target 
HPC  platforms.  We  then  perform  scalability  analysis  on  STAP  algorithms,  and  demonstrate  the 
design  of  efficient  parallel  algorithms  for  these  tasks. 

Our  approach  to  design  efficient  solutions  for  real-time  applications  on  the  HPC  platforms 
consists  of  the  following  steps: 

•  Definition  of  a  realistic  computational  model  of  the  HPC  systems.  The  model  represents 
the  salient  features  of  the  underlying  hardware  architecture. 

•  Analysis  of  the  application  task  to  be  performed,  and  identifying  the  phases  that  are  the 
most  time  consuming. 

•  Quantification  of  the  overheads  incurred  in  executing  the  task  on  the  architecture,  with 
the  assistance  of  the  model.  The  overheads  and  the  computation  time  are  calculated  as  a 
function  of  the  number  of  processors. 

These  functions  are  then  used  to  study  the  scalability  of  the  application-architecture  pair.  If 
the  total  execution  time  (overheads  +  computation  time)  is  inversely  proportional  to  the  number 
of  processors,  then  the  solution  is  said  to  be  scalable. 

For  our  analysis  of  STAP  tasks,  we  define  a  General  purpose  Distributed  Memory  (GDM) 
model.  This  model  balances  simplicity,  accuracy  and  generality.  We  model  the  parallel  system  as 
a  set  of  processors,  interacting  through  a  high  speed  network.  Most  HPC  systems  have  a  uniform 
high-level  architecture  of  this  form. 

The  communication  operation  in  such  a  system  can  be  modeled  with  two  parameters.  One  is 
startup  time  which  occurs  in  every  message  transfer.  This  includes  software  and  communication 
protocol  processing  overheads.  The  other  is  unit  transmission  time  which  is  the  cost  of  trans¬ 
ferring  a  message  of  unit  length.  Let  Td  denote  the  startup  time  and  Td  denote  the  transmission 
time.  Then,  a  communication  operation  for  sending  a  message  containing  m  units  of  data  from 
a  module  to  another  module  takes  Td  +  mvd  time.  This  is  the  case  when  “blocking”  send  is  used 
and  no  special  hardware  for  communication  is  available.  During  Td  +  mrd  time,  the  sending  pro¬ 
cessor  is  busy  handling  the  communication.  However,  if  there  is  a  communication  processor  and 
nonblocking  send  is  used,  the  sending  processor  has  to  spend  only  the  startup  time  to  activate  the 
communication  processor.  After  this  initial  startup  time;  the  processor  is  free  and  can  continue 
its  computations,  while  actual  transfer  of  messages  is  handled  by  the  communication  processor. 

The  GDM  model  accurately  represents  the  communication  features  of  many  HPC  architec¬ 
tures.  The  model  also  represents  features  for  overlapping  computation  with  communication  and 
latency  hiding(for  example,  this  feature  represents  the  DMA  access  capability  of  the  LANai  net¬ 
work  interface  chip  in  Myricom  systems  [16]).  In  this  case,  a  communication  operation  is  assumed 
to  take  only  Td  time. 

Since  many  different  platforms  are  under  development,  architecture  independent  algorithms 
and  portability  of  the  developed  code  are  very  important.  We  specify  our  algorithms  using  the 
MPI  standard  [17]  to  ensure  portability  of  our  code. 

In  general,  the  data  layout  problem  is  to  design  a  distribution  of  data  among  the  processors 
such  that  the  overall  communication  time  is  minimized,  and  the  workload  is  balanced.  In  typical 
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signal  processing  applications,  the  data  is  accessed  and  processed  in  different  ways  in  each  sub¬ 
problem.  The  goal  of  data  remapping  is  to  rearrange  the  data  between  the  steps  of  the  algorithm 
so  as  to  reduce  the  communication  time.  In  some  scenarios,  the  overhead  incurred  in  performing 
the  remapping  operation  might  be  larger  than  the  benefit  due  to  remapping.  To  estimate  the 
tradeoffs,  we  use  the  model  to  evaluate  the  benefits  and  overheads  due  to  data  remapping. 

Most  signal  processing  problems  have  regular  and  known  computation  and  communication 
characteristics.  This  permits  mapping  and  scheduling  of  computations  to  be  performed  at  compile 
time.  Indeed,  this  feature  has  been  exploited  in  systolic  arrays  to  transform  the  computations 
into  a  form  that  can  be  executed  with  localized  communication.  The  scheduling  problem  for  HPC 
platforms  is  different  from  that  of  custom  VLSI  systems  on  account  of  the  larger  granularity  of 
the  computations,  large  overheads  in  performing  communication,  and  the  capability  to  overlap 
computation  with  communication. 


3.7  Computational  Characteristics  of  Element-Space  STAP 

STAP  consists  of  an  antenna  array  (with  N  elements)  that  transmits  pulses  and  then  receives 
the  reflected  pulses.  Each  element  accumulates  data  in  this  manner  at  L  different  ranges.  Thus 
the  total  data  collected  in  one  PRI  (Pulse  Repetition  Interval)  has  NL  data  samples.  After  M 
PRPs,  the  accumulated  data  constitute  a  Coherent  Processing  Interval  (CPI)  data  cube.  Figure 
3.10  (a)  shows  the  data  arrival  sequence,  while  Figure  3.10  (b)  shows  the  data  cube. 

A  significant  portion  of  STAP  involves  matrix  algebra  computations.  STAP  consists  of  three 
major  stages  -  training  strategy,  weight  computation,  and  weight  application.  Generally,  weight 
computation  requires  the  solution  of  a  linear  system  of  equations.  This  step  is  a  computation 
intensive  portion  of  the  space  time  processor.  Due  to  enormous  computational  requirements 
of  fully  adaptive  STAP,  many  partially  adaptive  processing  techniques  are  developed.  In  this 
report,  we  are  considering  on  ESPrD  (Element-Space  Pre-Doppler)  and  ESPsD  (Element-Space 
Post-Doppler)  processing  techniques,  which  are  both  partially  adaptive  STAP  approaches. 

We  parallelize  weight  computation  steps  of  ESPrD  and  ESPsD,  which  are  the  major  computa¬ 
tional  bottlenecks  in  the  entire  processing.  In  the  ESPrD,  M'  =  M  -  K  +  1{K  =  2  ~  3)  adaptive 
processing  are  performed  over  subsets  of  size  N  x  K  of  &  snapshot,  as  shown  in  Figure  3.11.  The 
data  cube  is  accessed  along  element  dimension  during  adaptive  processing.  These  subsets  are 
formed  in  an  overlapped  fashion  along  PRI  dimension.  Each  adaptation  involves  KN  x  KN  di¬ 
mension  linear  equations  which  are  to  be  solved  by  QR  decompositions.  After  weight  vectors  are 
applied,  Doppler  processing  (one  M'-point  FFT)  is  performed  for  a  given  snapshot.  The  weight 
computation  steps  of  ESPsD  are  as  follows:  first,  Doppler  processing  is  performed  separately  on 
each  element.  This  requires  N  Af-point  FFT’s  along  PRI  dimension.  Then,  a  different  adaptive 
problem  is  solved  for  each  Doppler  bin,  utilizing  all  elements.  M  QR  decompositions  of  size 
N  X  N  are  performed  and  data  cube  is  accessed  along  element  dimension,  as  shown  in  Figure 
3.12.  Figure  3.13  shows  the  computational  requirements  of  ESPrD  and  ESPsD. 

For  the  range  of  16  <  M,iV  <  128,  the  estimated  flop  counts  are  shown  in  Figure  3.14,3.15, 
3.16,3.17.  It  can  be  easily  observed  that  the  flop  count  grows  exponentially  as  N  grows.  This 
results  from  the  fact  that  performing  QR  decomposition  dominates  other  operations.  Thus,  the 
computational  complexities  of  ESPrD  and  ESPsD  are  both  0{LMN^). 
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Element 


(a)  Sample  arrival  sequence 


PRI 


(b)  Input  data  to  STAP 


Figure  3.10:  The  input  data  cube  and  sample  arrival  sequence 
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ESPrD 


ESPsD 


•  Adaptive  Processing: 

(LM'  QR  decompositions  of  size  KNxKN) 
2LM'(KNf/3  flops 

•  Doppler  Processing: 

(L  M'-point  FFT’s) 

5LM'\ogM'  flops 


•  Doppler  Processing: 

{LN  M-point  FFT’s) 

SLNMlogM  flops 

•  Adaptive  Processing: 

(LM  QR  decompositions  of  size  NxN) 

2LMN^/3  flops 


Figure  3.13:  Computational  Complexities  for  ESPrD  and  ESPsD 


ESPrD  (L=512) 


Figure  3.14:  Computational  Requirements  of  ESPrD  v:^hen  L  —  512 
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flops 


ESPsD  (U512) 


Figure  3.15:  Computational  Requirements  of  ESPsD  when  L  =  512 


ESPrD  (L=4096) 


Figure  3.16:  Computational  Requirements  of  ESPrD  when  L  =  4096 


ESPsD  (L=4096) 


Figure  3.17:  Computational  Requirements  of  ESPsD  when  L  —  4096 

3.8  Scalable  Portable  Algorithms  for  ESPrD  and  ESPsD 

In  this  section,  we  illustrate  the  design  of  scalable  parallel  algorithms  for  ESPrD  and  ESPsD.  Let 
Xnml  be  the  sample  signal  from  the  n-th  element,  m-th  pulse,  and  at  the  Lth  range-gate.  The 
N  antenna  elements  independently  transmit/receive  signals  and  compose  part  of  the  data  cube. 
In  implementing  the  STAP  tasks  on  a  parallel  HPC  platform,  we  assume  that  N  I/O  processors 
(eo,...,eiv-i)  collect  the  data  from  each  element  and  distribute  them  to  P  computing  proces¬ 
sors  {po, . .  .,pp_i).  The  distribution  of  the  data  onto  the  processors  affects  the  communication 
requirements,  the  degree  of  concurrency,  and  the  load  on  the  processors.  Our  goal  is  to  design 
efficient  algorithms  which  can  provide  high-speed  execution  on  available  machine  sizes  and  real¬ 
istic  problem  sizes.  For  scalable  performance,  efficient  data  layout  and  data  distribution  schemes 
should  be  developed  so  that  the  parallelizing  overhead  is  minimal;  these  schemes  should  minimize 
the  inter-processor  communication  time.  We  parallelize  weight  computation  steps  of  ESPrD  and 
ESPsD  at  the  task  level.  This  means  that  individual  tasks  (FFT  and  QR  decomposition)  are 
performed  sequentially  on  a  single  processor.  However,  parallelism  is  achieved  by  distributing 
whole  tasks  among  the  processing  elements.  The  reason  for  employing  task  level  parallelism  is 
that  the  problem  sizes  of  the  tasks  in  ESPrD  are  too  small  to  parallelize  efficiently.  Our  previ¬ 
ous  experience  [9]  shows  that  the  cumulative  communication  overheads  incurred  in  parallel  FFT 
and  QR  decomposition  algorithms  (such  as  ScaLAPACK  routines  [7]),  prevents  scalable  perfor¬ 
mance.  We  choose  data  mapping  schemes,  so  that  each  node  pi  has  all  the  data  necessary  for  its 
computation. 

In  this  data  mapping  scheme,  the  communication  occurs  when  N  I/O  processing  nodes 
distribute  their  data  to  P  computing  nodes.  This  distribution  constitutes  an  all-to-all  communi¬ 
cation  pattern.  It  should  be  noted  that  the  number  of  computing  nodes  (P)  is  much  larger  than 
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the  number  of  I/O  processing  nodes  (N),  i.e.  P  N .  This  is  because  a  large  number  of  com¬ 
puting  nodes  are  necessary  to  deal  with  the  enormous  computational  requirements  for  real-time 
needs.  We  develop  efficient  distribution  algorithms  when  a  small  set  of  processors  distribute  their 
data  equally  to  a  large  set  of  processors.  This  happens  frequently  in  the  general  real-time  signal 
processing  applications,  because  the  number  of  sensors  is  limited  while  the  number  of  process¬ 
ing  nodes  required  for  real-time  needs  is  much  greater.  Also,  when  exploiting  task  parallelism, 
compute  intensive  tasks  must  be  distributed  among  many  processors  to  alleviate  computational 
bottlenecks. 

A  simple  scheduling  scheme  consists  of  P  steps  of  communication  with  messages  of  size  ML/P 
as  shown  Figure  3.18.  Using  the  GDM  model,  this  communication  needs  Tcommi  time,  where 

Tcommi  —  P  'TdP  ML  •  Td  (3.12) 

This  is  because  at  most  N  communications  can  be  performed  in  parallel  due  to  node  contention, 
while  there  are  NP  communications  to  be  performed.  Thus,  the  cumulative  start  up  overheads 
become  large  as  P  grows.  Figure  3.19  shows  communication  scheduling  which  reduces  cumulative 
start  up  overheads.  It  partitions  P  computing  nodes  into  K  =  ^  groups.  Each  group  has  N 
computing  nodes.  Thus,  each  I/O  processing  node  can  distribute  ^  size  data  to  first  group,  then 
second  group,  and  so  on.  This  distribution  takes  Td  -t-  As  soon  as  all  processing  nodes 

in  a  group  receive  data  from  I/O  processing  nodes,  the  remapping  operation  can  be  started. 
The  timing  diagram  shown  in  Figure  3.20(a)  illustrates  this  pipelined  communication  scheduling. 
Because  the  remapping  within  a  group  takes  {N  -  l){Td  -h  ^Td},  total  communication  time 
becomes 

+  (3.13) 

(— +  -  l)rci  +  MZ.(1  H - —  )Ti{  (3.14) 

In  ESPsD,  there  are  many  FFT  computations  to  be  performed  after  the  first  distribution.  There¬ 
fore,  we  can  overlap  the  remapping  process  with  FFT  computations  as  shown  in  Figure  3.20(b). 
Also,  separate  communicators  for  distribution  and  remapping  processes  can  be  used,  which  MPI 
provides  to  avoid  unnecessary  synchronization  during  message  passing.  Figure  3.21  and  Figure 
3.22  show  the  estimated  communication  time  on  SP2  and  T3D  for  various  values  of  L,  M,  and  N. 


3.9  Conclusion 

In  this  work,  we  have  developed  scalable  portable  algorithms  for  partially  adaptive  STAP.  The 
main  contributions  are  in  data  remapping,  scheduling  and  overlapping  computation  with  com¬ 
munication.  Our  techniques  result  in  uniform  distribution  of  load  on  the  processors  as  well  as 
in  minimization  of  the  communication  time.  Earlier  implementations  are  not  scalable  and  the 
communication  overheads  dominate  the  overall  execution  time  for  a  large  number  of  processors. 
The  implementations  were  performed  using  MPI  and  can  be  easily  ported  to  other  machines,  in¬ 
cluding  emerging  HPC  platforms  for  embedded  systems  applications  such  as  the  Martin  Marietta 
Lab’s  High  Performance  Scalable  Computing  System  [3]. 

We  have  also  parallelized  weight  computation  steps  of  ESPrD  (Element  Space  Pre-Doppler) 
and  ESPsD  (Element  Space  Post-Doppler)  STAP.  We  have  developed  efficient  distribution  algo¬ 
rithms  when  a  small  set  of  processors  distribute  their  data  equally  to  a  large  set  of  processors.  We 
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P  computing  nodes  {P-KN) 


GroupO 


(a)  Timing  diagram 


Group  i 


(DFFTl 

(D  Overlap  Remap  1  with  FFT2 
(D  Overlap  Remap2  with  FFT3 
®  Overalp  Remap3  with  FFT4 
<D  Remap4 


1  &  Remap  1 
FFT2  &  Remap2 
FFT3  &  Remap3 
FFT4  &  Remap4 


(b)  Overlapping  communications  with  computations 


Figure  3.20:  Timing  diagram  and  overlapped  communicaiton  with  computation 
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Time  in  msec 


Communication  Times  on  SP-2 


Number  of  Processors 


Figure  3.21:  Estimated  communication  time  on  SP2  (o:  Pipelined  schedule,  x;  Simple  Schedule) 
for  N,M,L  =  (64,64,2048),  (32,32,1024),  (16,16,512). 
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Communication  Times  on  T3D 


Figure  3.22:  Estimated  communication  time  on  T3D  (o:  Pipelined  schedule,  x:  Simple  Schedule) 
for  N,M,L  =  (64,64,2048),  (32,32,1024),  (16,16,512). 
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have  developed  efficient  distribution  algorithms  by  using  data  remapping  and  processor  groups, 
which  will  therefore  reduce  the  effect  of  node  contention  significantly.  Although  we  have  focused 
on  two  variations  of  STAP,  the  techniques  of  communication  scheduling  and  overlapping  com¬ 
putation  with  communication  are  adequately  general.  They  can  therefore  be  applied  to  several 
other  signal  processing  computations  to  yield  efficient  implementations. 

Our  algorithms  can  also  be  easily  ported  to  shared  memory  systems.  The  Prefetch  command 
can  be  used  for  latency  hiding,  thus,  the  computation  and  communication  can  be  overlapped.  It 
is  easy  to  perform  a  similar  analysis  as  shown  in  this  section.  For  example,  on  shared  memory 
machine  models,  one  can  easily  perform  the  analysis  using  the  Block  Distributed  Memory  model 

[9]. 

Emerging  HPC  platforms  provide  many  features  for  communication  that  can  be  accessed  via 
standards  such  as  MPI.  Algorithmic  techniques  can  exploit  these  to  lead  to  scalable  portable  algo¬ 
rithms  for  Linear  Algebra  problems  that  offer  superior  performance  compared  with  ScaLAPACK 
for  the  size  of  problems  encountered  by  the  signal  processing  and  ATR  communities. 
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